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ABSTRACT 

Simulated data were used to investigate the 
performance of modified versions of the Mantel-Haenszel and 
standardization methods of differential item functioning (DIF) 
analysis in computer-adaptive tests (CATs) . Each "examinee" received 
25 items out of a 75-item pool. A three-parameter logistic item 
response model was assumed, and examinees were matched on expected 
true scores based on their CAT responses and on estimated item 
parameters. Both DIF methods performed well. The CAJ-based DIF 
statistics were highly correlated with DIF st itistics based on 
nonadaptive administration of all 75 pool items and with the true 
magnitudes of DIF in the simulation. DIF methods were also 
investigated for "pretest items," for which item parameter estimates 
were assumed to be unavailable. The pretest DIF statistics were 
generally well-behaved and also had high correlations with the true 
DIF. The pretest DIF measures, however, tended to be slightly smaller 
in magnitude than their CAT-based counterparts. Also, in the case of 
the Maitel-Haenszel approach, the pretest DIF statistics tended to 
have somewhat larger standard errors than the CAT-DIF statistics. 
Appendix A contains 10 supplementary tables; and Appendixes B, C, and 
D present additional information about the expected table estimator. 
Twenty-two tables in Appendix D present analysis results. (Contains 
24 references.) (Author/SLD) 
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Simulated data were used to investigate the performance of modified versions of the 
Mantel-Haenszel and standardization methods of differential item functioning (DIF) analysis 
in computer-adaptive tests (CATs). Each "examinee" received 25 items out of a 75-item 
pool. A three-parameter logistic item response model was assumed, and examinees were 
matched on expected true scores based on their CAT responses and on estimated item 
parameters. Both DIF methods performed well. The CAT-based DIF statistics were highly 
correlated with DIF statistics based on nonadaptive administration of all 75 pool items and 
with the true magnitudes of DIF in the simulation. DIF methods were also investigated for 
"pretest items," for which item parameter estimates were assumed to be unavailable. The 
pretest DIF statistics were generally well-behaved and also had high correlations with the true 
DIF. The pretest DIF measures, however, tended to be slightly smaller in magnitude than 
their CAT-based counterparts. Also, in the case of the Mantel-Haenszel approach, the pretest 
DIF statistics tended to have somewhat larger standard errors than the CAT DIF statistics. 
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1 . Overview 

Many large-scale testing programs are now developing or piloting computer-adaptive 
tests (CATs). Among these are the Scholastic Aptitude Test (SAT), the Graduate Record 
Examinations (GRE), and Praxis (successor to the NTE teacher assessment), developed at 
Educational Testing Service (ETS), the COMPASS placement tests produced by the American 
College Testing Program, the College Board Computerized Placement Tests, the Differential 
Aptitude Tests published by the Psychological Corporation, and the Armed Services 
Vocational Aptitude Battery (ASVAB). The item responses collected from an examinee in a 
CAT may be a small fraction of the data that would have been collected in a corresponding 
nonadaptive test. Furthermore, the items received by each examinee are a nonrandom subset 
of the available pool of items. The introduction of CATs requires that new approaches be 
developed for assessing validity and reliability and for analyzing item properties, including 
differential item functioning (DIP). 

The purpose of our project was to investigate whether existing DIP analysis methods 
could be modified to accommodate the data collected in a CAT. There are several reasons 
that DIP detection may be more important for CATs than it is for nonadaptive tests. First, 
because fewer items are administered in a CAT, each item response plays a more important 
role in the examinee's test score than it would in a nonadaptive testing format Any flaw in 
an item, therefore, may be more consequential for the examinee. Second, item difficulty and 
DIP have been found to be positively related to an appreciable degree for some pairs of 
populations (e.g., Kulick & Hu, 1989). Therefore, if the group of primary interest-the /(^ca/ 
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group-scores substantially below the comparison, or reference group, the CATs encountered 
by the focal group members will be made up of easier items than the CATs encountered by 
reference group members. If easier items have, on average, more negative DIF (i.e., DIF 
disadvantaging the focal group; than harder items, then the 5. ores of focal group members 
may be lower than they should be and even lower than they would be on a comparable 
nonadaptive version of the test (Holland & Zwick, 1991). Finally, administration of a test by 
computer creates several potential sources of DDF that are not present in conventional tests, 
such as differential computer familiarity, facility, and anxiety, and differential preferences for 
computerized administration. Legg and Buhr (1992) and Schaeffer, Reese, and Steffen (1992) 
both report ethnic and gender group differences in some of these attributes. Their findings 
suggest that attitudes toward computer testing may be surprisingly complex. For. example, 
Schaeffer, Reese, and Steffen (1992) found that Asian test-takers were most likely to have a 
computer available at home and most likely to report that using the computer mouse was very 
easy. Yet both Schaeffer et al. and Legg and Buhr found that Asian examinees were more 
likely than any other ethnic group to state that they preferred paper-and-pencil to 
computerized administration. 

To investigate DIF detection in CATs, we simulated data consisting of responses to 
three different pools of 75 items. In Pool 1, the items had no DIF, in Pool 2, the items had 
DIF that was uncorrelated with item difficulty, and in Pool 3, the items had DIF that was 
correlated with item difficulty. The only kind of DIF that was studied was a difference in 
item difficulty for the reference and focal groups, often called uniform DIF. The distance 
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between reference and focal group means and the sample sizes for the two groups were 
varied, as was the DIF status of the items and the item difficulties and discriminations. 

Using a CAT algorithm based on item information, each "examinee" was assigned 25 
items from one of the three pools of 75 items. Responses to the selected items were 
generated using the three-parameter logistic (3PL) item response theory model. The 
maximum likelihood estimate (MLE) of the examinee's ability was recomputed after each 
item was administered and the next item selected was the most informative item at the 
examinee's estimated ability. 

The simulated data were used to investigate the feasibility of conducting DIF analyses 
using modified versions of the Mantel-Haenszel (MH; 1959) approach of Holland and Thayer 
(1988) and the standardization method of Dorans and Kulick (1986). Examinees were 
matched on the expected true score for the entire 75-item pool, computed using estimated 
ability from the 25 CAT items and estimated item parameters. An approach of this kind was 
suggested by Steinberg, Thissen, and Wainer (1990). 

In addition, DIF analyses were conducted for "pretest" items that were administered 
nonadaptively. All examinees received the same set of pretest items, along with the CAT. 
For DIF analyses of the pretest items, the matching variable was the sum of the expected true 
score based on the CAT responses and the score (0 or 1) on the item under analysis, referred 
to as the studied item. 

To disentangle the effects of assigning items via the CAT algorithm on one hand and 
matching examinees on expected true score on the other, we also included, for some 
simulation conditions, a "nonadaptive control" analysis in which the matching variable for 
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DIF analysis was the expected true score computed with the MLE estimated from responses 
to all 75 pool items. The results of this analysis were compared to the results obtained by 
matching on the CAT-based expected true score and to results obtained by matching on 
number-right score, as in conventional MH and standardization analysis. 

The CAT-based DIF statistics were found to be highly correlated with true DIF and 
with DIF measures based on nonadaptive administration. Furthermore, the mean DIF 
statistics for each pool were cbse to their nominal value of zero. Although Pool 3 DIP 
statistics wei^ not quite as well-behaved as the Pool 2 statistics, our results, in general, appear 
to provide good news for testing programs that wish to establish DEF screening procedures for 
CATS. In the case of the pretest items, the DIF statistics also appeared to be well-behaved. 
However, the sirndard errors of the Mantel-Haenszel DIF statistics tended to be larger than in 
the CA T, reducing the power to detect DIF. 

2, Simulation procedures 

Our principle in developing the simulation design was to aim for some reasonable 
compromise between an approach that was realistic (in that it mimicked the properties of an 
actual CAT) and one that was simple enough to yield useful, interpretable results. In 
designing the simulation, we consulted with staff from ETS testing programs to ensure that 
our decisions were likely to produce data that were substantially consistent with actual ETS 
test results. The design of the simulation had three main components: determination of the 
"administration" conditions, definition of the properties of the simulated CAT, and 
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specification of thu parameters of the CAT pool items and pretest items. These components 
are described in the following sections. 

2.1 Administration conditions 

Eighteen data sets were created, each corresponding to a CAT administration. The 
admir.istrations were defined by the properties of the item pool, the ability distributions of the 
reference and focal groups, and the group sample sizes. These factors are described below. 
The number of levels of the three factors was 3, 3, and 2, respectively, resulting in 18 distinct 
data sets, the properties of which are summarized in Table 1. 

Insert Table 1 about here. 

Item pool: Three item pools were included. Pool 1 had no DIF; its purpose was to allow 
investigation of the functioning of the DIF methods in the null case. Any conclusion 
of DIF for this pool would constitute a Type I error. Two types of DIF pools were 
included: Pool 2 had DIF that was uncorrected with item difficulty, and Pool 3 had 
DIF that was positively correlated with item difficulty. Research has found that, for 
some pairs of ethnic groups, DIF tends to be positively correlated with item difficulty, 
whereas for male-female analyses, this tends not to be true (e.g., Kulick & Hu, 1989). 
Pools 2 and 3 were created to allow investigation of the effect of this correlation. The 
item difficulty, discrimination, and guessing parameters were the same across all three 
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pools of items; only the DIF properties varied. Details on the pattern of DIF are given 
in section 2.2. 

Focal group ability distribution: The three possible focal group distributions were N(-l, 1), 
N(0, 1), and N(+.5, 1). In each case, the reference group had a N(0, 1) distribution. 
The differences between reference and focal group means were chosen to be 
representative of group differences encountered in ETS DIF analyses. 

Group sample size conditions: Two sample size conditions were included: tij^ = 500, rif = 
500; and n^? = 900, rif = 100, where n^? and ftf are the sample sizes for the reference 
and focal groups, respectively. Like the focal group distributions, these sample size 
conditions were chosen to be similar to those that occur in ETS analyses. 

2.2 CAT simulation 

In simulating the CAT data, item responses were generated based on the true item 
parameters, using the 3PL item response function, 

PO) = ^ (1 ~ cp (1 ^ txp{-lJaj{Q-bj^)y\ (1^ 

where p (Q) is the probability of answering item ; correctly for examinees with ability 9, Uj 
and Cj are the discrimination and guessing parameters, respectively, bja is the difficulty in 
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group G (G ^ reference or focal), and the factor of 1.7 is included to make the logistic scale 
into an approximate probit scale (Lord & Novick, 1968, p. 400). The focal group difficulty, 
bjf, was obtained by adding the item dj value, which could be positive or negative, to the 
reference group difficulty, bji^. A response was generated as correct if a random number 
drawn from a uniform distribution between 0 and 1 wa*' less than the value of the item 
response function computed at the true ability. Otherwise the response was incorrect. 

The CAT simulation was designed as a simplified version of actual CATs being 
developed at ETS. The CAT algorithm selects as the next item to be administered the most 
informative item at the maximum likelihood estimate of ability computed from the items 
already administered.^ (Estimates of item information and examinee ability were computed 
using estimated item parameters, described in section 2.3.6). Most actual CATs under 
development at ETS select items on the basis of both information and other characteristics, 
such as item format and content. 

We based our study on a fixed-length CAT of 25 items. This is similar to the number 
of items in a single section of the SAT and GRE CATs. To determine the number of items 
in the CAT pool, we conducted trial CAT simulations to allow us to investigate patterns of 
item use for pools with various item properties. We considered pools of 75 and 100 items, 
and concluded that the 75-item pool was superior in that a higher percentage of the items 



^The CAT algorithm was implemented in a revised version of a program written by 
Martha Stocking based on the approach of Lord (1976). The item information function is 

defined as P'/6)VP.(9)(2.(9), where P.(9) is the item response function (in this case, the 

3PL function defined in equation 1), .(9) is the first derivative of ?.(0) with respect to 9, 

and (2/9) = 1 - P.(9) (see Lord, 1980). 



;vere actually used. This ratio of items in the pool (75) to items administered per examinee 
(25) is smaller than in many real applications. However, using a larger pool would have 
meant a reduction in the percent of pool items that were administered. 

Selection of the most informative item at the examinee's estimated ability was 
achieved using an item information table, shown in Table A-1 in Appendix A, that contains 
columns for equally spaced abilities from -2 to 2 at intervals of .2. Each column lists the 
item numbers sorted in descending order by the item information at that abi.^^y level. The 
table contains 25 rows, since each CAT consisted of only 25 out of the 75 pool items. (To 
allow additional analysis, examinee responses were also generated for all of the pool items 
not administered in the CAT.) 

In a process similar to that used in actual CATs, the first item administered was 
randomly selected from the first four items in the column at ability zero. The second item 
was randomly selected from the first three items at either an ability of -2 or +2, depending on 
whether the first item was answered incorrectly or correctly, respectively. Examinees with 
all-incorrect or all-correct patterns after responding to item 2 continued to receive the most 
informative item (among those not yet administered) from the -2 or the +2 column, 
respectively. Once an examinee had both a right and a wrong answer, ability was reestimated 
by maximum likelihood following each item response. Each subsequent item was selected 
from the column of the information table which was closest to the examinee's estimated 
ability, calculated from responses to all items answered up to that point. The most 
informative item that had not already been given to that examinee was administered. 
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Examinees that answered all CAT items incorrectly were assigned an ability estimate of -10. 
Examinees that answered all items correctly were given an ability estimate of 10. 

Item usage for all conditions is given in Table A~2 in Appendix A. The body of the 
table gives the number of examinees, out of a total of 60,000 for each population group, who 
were administered each item in the pool. Note that four of the 75 CAT items were never 
administered. This occurred because, at every ability level, there were at least 25 items that 
were more informative than these items< This phenomenon occurs in real CATS as well. To 
show how the usage of items varies across ability level, two illustrative tables were produced. 
Table A-3 shows item usage for various ability intervals in the reference group. Table A-4 
gives the corresponding information for the focal N(-l,l) group for the Pool 3 items. 

2.3 Specification of item parameters 

Within each of the 18 "administrations," the factors that were varied were the item 
discrimination (a) and reference group difficulty (b) parameters^ and the item d parameters, 
representing the degree to which the item difficulties differed. Decisions needed to be made 
about the distributions of item parameters (assuming a 3PL item response function) and of the 
DIF parameters d. We chose to use multivariate normal distributions to model the joint 
distribution of the DIF and item parameters, with a natural log transformation applied to the a 
parameter. We used three different multivariate normal distributions, each corresponding to 

^Although we use the notation to represent the reference group difficulty in some 
instances, we suppress the subscript for simplicity of notation in others. A b without a 
subscript refers to the difficulty for the R group. 



15 

an item pool, to generate the items. The parameters of these distributions are given in Tables 
2 and 3. Sections 2.3.1 - 2.3.4 describe how we determined the means, standard deviations, 
and intercorrelations shown in the tables. The parameters for the pretest items were selected 
in a much simpler fashion, described in section 2.3.5. Procedures for obtaining item 
parameter estimates for use in analysis are described in section 2.3.6. 

Insert Tables 2 and 3 about here. 



2.3.1 Marginal mean and standard deviation of distribution of d 

In this study, the DIF parameter for item ; was defined as dj = bjj^ - bj^. Therefore, a 
value of d greater than zero implied that an item was easier for the focal group than for the 
reference group, whereas d less than zero implied that the item was harder for the focal 
group. To decide on the distribution of d in Pools 2 and 3, we used both theoretical and 
empirical findings on the relation of MH D-DIF to d. 

Donoghue, Holland and Thayer (1993) used the work of Holland and Thayer (1988) to 
show that, under certain Rasch model conditions, the MH D-DIF statistic provides an estimate 
of -Aad. The assumptions under which this finding holds are that (1) within each of the 
groups (reference and focal), the item response functions follow the Rasch model (obtained 
from equation 1 by setting Cj = 0 for all items ; and aj = a for all items;) (2) the matching 
variable is the number-right score based on all items, including the studied item, and (3) the 
items have the same item response functions for the reference and focal groups (i.e., bjj^ = bjf 
s bj)y with the possible exception of the studied item. 
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Previous simulation work has shown that, when guessing is present, the appropriate 
multiplier is less than 4. In addition to nonzero guessing parameters, our simulation study 
included two values of a, rather than a single one. To help us select an appropriate iUarginal 
mean and standard deviation of d for Pools 2 and 3, we examined the regression of MH D- 
DIF on ad for several sets of simulated data. We found the multiplicative constants to be 
between 2 and 3 and the additive constants to be about zero. Using this result, we were able 
to determine a mean and standard deviation for d (0 and .3) that would produce realistic 
distributions of MH D-DIF. In Pool 1, the DIRess pool, the mean and standard deviation of 
d were, of course, zero. 

2.3.2 Marginal means and standard deviations of distributions of item parameters 

Properties of actual data sets were used to determine how to model the marginal 
means and standard deviations of the item parameters. Verbal and Mathematical sections of 
two forms of the SAT test were obtained from College Board Statistical Analysis for this 
purpose. One form, 3KSA07, had not been screened based on DIF pretest information, the 
other form, 3LSA02, had been. We looked at the statistics for all items and for only those 
items that were inchided in the pool. From these, the means and standard deviations of In a, 
c, and the MH D-DIF statistic (for male-female, White-Black, and White- Asian analyses) 
were obtained. Also, as supplementary information, the means and standard deviations of 
item parameters from the initial CAT pool for the ORE quantitative section were obtained. 
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To simplify the simulation, we set Cj equal to .15 for all items. This value was close to the 
average value in the SAT and GRE data sets. The means and standard deviations of In a 
(-.15, .30) and b (0, .15) were also chosen to be similar to the average values for these data. 

2.3.3 Intercorrelations among item and DIF parameters. 

The SAT data sets described in the previous section were also used in modeling the 
intercorrelations among the parameters. For purposes of determining the correlation of DIF 
with the other parameters in Pools 2 and 3, we used the MH D-DIF statistic as a proxy for 
To aid in determining reasonable intercorrelations of a, b, and d, the partial correlations of the 
estimates of In a, by and MH D-DIF, with the estimated c partialed out, were examined, in 
addition to the zero-order correlations. 

The intercorrelations of the item parameters were determined as follows. Pool 1 has 
no DIF, so d is uncorrelated with In a and with b. By design, d is also uncorrected with b in 
Pool 2, which closely resembles the actual results for male-female DIF in our SAT data sets, 
in the data analyzed by Kulick and Hu (1989) and in other unpublished analyses of College 
Board data. The correlation of b and d for Pool 3 was set equal to .40, which is 
approximately equal to the average of the correlations of b and MH D-DIF for the White- 
Black and White-Asian DIF analyses in the SAT data sets. The average correlation of In a 
and MH D-DIF in the SAT data sets was .04. There was considerable variation, but it did 
not seem to follow a meaningful pattern. Therefore, a value of zero, which approximated the 
mean correlation, was assigned for all three pools. The average correlation of In a ^nd b was 
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about .40; this value was assigned in all 3 pools. Table 3 shows the correlation values used 
for modeling the joint distribution of In a, b, and d in each of tl three pools. 

2.3.4 Discretized multivariate normal approach 

To generate the item parameters, we assumed a multivariate normal distribution of 
In a, b, and dy and then discretized it so that only selected values of each parameter could 
occur, By discretizing the distribution, we could assure that only a finite number of item 
types were possible, to facilitate summarization and interpretation of results. The values of 
the parameters that were selected for inclusion in the study were 

In a: -.3, 0 (corresponding to a values of .74, 1) 

b: -1.95, -1.3, -.65, 0 .65, 1.3, 1.95 

d: -.70, -.35, 0, .35, .70 in Pools 2 and 3; = 0 for all items in Pool 1. 
This implies a total of 14 possible combinations of a and by each of which could have five 
possible levels of DDF in Pools 2 :^nd 3. 

The probabilities from a multivariate normal distribution witti the specified parameters 
were used to assign probabilities to the cells of a 2 x 7 x 5 contingency table. To understand 
how this was done, consider the b parameter. As noted above, there were to be seven values 
of by separated by .65. The probability associated with b ^ x was defined as P(x - .65/2 <b < 
X + .65/2), with the following modification: Probabilities associated with values outside the 
intervals surrounding the desired seven values of b were set to zero, and the remaining 
probabilities were renormcd so that they would sum to 1. The resulting probabilities were 
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then multiplied by the desired number of items for the pool and then rounded to integer 
values. 

We generated the parameters for Pool 3 first. Pool 1 was easily obtained from Pool 3 
by setting all d parameters to 0. Pool 2 was obtained as described above but with the added 
restriction that the marginals for a and b remain the same as for the other two pools. The 
generated frequencies of d needed to be adjusted to meet this restriction. The resulting joint 
distribution was ihen checked to verify that the correlations were close to th' -r intended 
value:^.^ The joint frequency distributions of CAT item parameters are given in Tables A-5, 
A-6, and A-7 for Pools 1, 2, and 3, respectively. The a, b, and d parameters for all three 
pools of items are given in Table A-8. 

2.3.5 Nonadaptive pretest item parameters 

In large testing programs, test forms often include not only items that will be used in 
computing the examinees's overall score, but "pretest" items that are being evaluated for 
possible future use. Because the items have never been administered, item parameter 
estimates are not available. In CAT-administered exams, some testing programs are choosing 
to accompany adaptively administered items with a set of pretest items that are not adaptively 
administered. Therefore, we wanted to consider DIP analysis procedures for such items. 

^Note that, in this study, an item number (1-75) defines a combination of a, b, and c 
parameters and a, b, and c parameter estimates. These values are associated with that item 
number, regardless of item pool. However, the DIP properties of the items vary across pools. 
The items in Pool 1 have no DIP and the amount of DIP associated with a particular item 
number is not, in general, the same for Pools 2 and 3. 
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Responses were generated to the same set of 15 pretest items for each examinee. For these 
items, all values of In a were equal to 0 (corresponding to a = 1), The five levels of d were 
crossed with three levels of b: -1.3, 0, and 1.3. Parameters for the pretest items are given in 
Table A-9. The pretest items were identical for all examinees, regardless of the pool from 
which the CAT items were selected. The DIF analysis method applied to pretest items is 
discussed in section 3.2. 

2.3.6 Item parameter estimation for the CAT 

The CAT item parameter estimates used for computing item information and ability 
estimates were obtained through an analogue to a paper-and-pencil test administration, (The 
adminisu-ation and analysis of the pretest items did not require that item parameter estimates 
be obtained for these items,) A sample of 2,000 examinees were "administered" all 75 items, 
and the LOGIST program (Wingersky, 1983; Wingersky, Patrick, & Lx)rd, 1988) was used to 
estimate the a, and c parameters for each item. Because 2,000 is a typical sample size for 
such calibrations, this approach allowed us to incorporate a realistic amount of estimation 
error. The estimated a, b, and c parameters, which were the same for all three pools, are 
given in Table A- 10, along with the true parameters. 

We included only members of the reference population in our calibration sample. 
Initially, we considered using a sample consisting of both reference and focal group members 
for item calibration or using a weighted combination of true reference and true focal group 
parameters, possibly with an eiror term added. However, because we wished to compare c jr 
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results across simulation conditions, it was desirable to use the same set of parameter 
estimates in all cases. In fact, in actual CATs, a single set of parameter estimates is used, 
regardless of the demographic composition of the test-takers in a particular administration. It 
was not possible to define a calibration sample that included members of all three focal 
groups in a manner that was realistic or useful; therefore, including only reference group 
members appeared to be the best procedure. In our simulation, estimation of both item 
information functions and examinee ability is based on an incorrect (DIFless) model for the 
focal group. This closely approximates the situation that arises in actual testing applications 
when the true item response functions are different for tiie two groups, but the focal group 
constitutes only a small proportion of the calibration sample. In this case, item parameter 
estimates are, for all practical purposes, estimates of the reference group parameters. 

3. DIF analyses 

Originally, our investigation was to focus on tiiree general DIF approaches: (1) the 
MH and standardization DEF methods, using expected true score on the CAT as a matching 
variable, (2) a variation on (1) for nonadaptive pretest items, in which the matching variable 
is the sum of the expected true score on the CAT and the score on the studied item, and (3) 
comparison of item percents correct for late-occurring items. In addition, we planned a 
comparison between DIF results obtained from CATs to results obtained by administering all 
pool items and matching either on expected true score based on all item responses or on 
number-right score. 
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Preliminary simulations allowed us to eliminate from further consideration the method 
based on comparison of item percents correct for late-occurring items. The reasons for 
eliminating this method are described in the next section, followed by a description of the 
MH and standardization methods. 

3.1 Comparison of item percents correct for late-occurring items 

In this proposed DIF analysis method, an exammee's data for an item were to be 
included in the analysis only if the examinee received the item in the latter part of the CAT. 
Then, the simple differences in item percents correct for the reference and focal groups were 
to be examined. This approach was based on the expectation that examinees who received an 
item toward the end of their CATs would be quite well matched in ability, so that DIF 
statistics and simple differences in percents correct would yield similar conclusions (Holland 
& Zwick, 1991). However, results from simulation data indicated that this matching strategy 
did not work as expected. Two types of simulation data were generated. In one simulation, 
item parameters for a 75-item CAT pool were constructed according to the procedures 
described in section 2.3. Five thousand examinees were selected from a standard normal 
ability distribution and were administered a 25-item CAT. The mean and variability of true 
ability for examinees who took items late in the test were then examined. Specifically, we 
compared the mean and standard deviation of true ability for (1) all examinees taking the 
item, (2) examinees taking the item in positions 16-25, and (3) examinees taking the item in 
position 25. In only 32 of 75 items was the variance oi ability smaller for examinees wno 
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took the item in positions 16-25 than for all examinees taking the item. Similarly, restricting 
attention to only those who took the item last did not assure a decrease in variability and, of 
course, led to dramatic sample size reductions. 

To determine whether this undesirable result was a result of artificial properties of our 
particular simulation, the same type of analysis was conducted using preliminary item 
parameter estimates for 90 items from the actual CAT pool for the GRE quantitative section. 
Again, results were obtained for a sample of 5,000 from a N(0,1) population. In this 
simulation, it was found that restricting attention to late usage (positions 16-20 for a 20-item 
CAT) led ' 1 variance reductions in only 40 of 90 items. Examination of the information 
tables for both these simulations showed that items often appeared toward the bottom portion 
of the table for several widely separated ability levels. The situation is likely to be 
exacerbated in the case of actual CATs, in which constraints on item type and content (e.g., 
not too many items on a particular topic) will mean that item information plays a less 
important role in selecting items. Based on our early simulation findings, we excluded this 
method from the remainder of our study. 

3.2 The Mantel-Haenszel and standardization DIF procedures 

In both the MH and standardization methods of DIF analysis, examinees are first 
grouped on the basis of a matching variable that is intended to be a measure of ability in the 



^The description of the MH D-DIF and STD P-DIF statistics is adapted from Donoghue, 
Holland, & Thayer (1993). 
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area of interest. In most DIF applications, the matching variable is a total test score, based 
either on the test in which the studied item is embedded or, if the studied item is being 
pretested^ on a separate test in the same subject area. The score on the studied item» group 
membership, and the value of the matching variable for each examinee defme a 2 x 2 x AT 
cross-classification of examinee data, where K is the number of levels of the matching 
variable. This 3-way classification forms the basis of both the MH and standardization 
procedures. One 2x2 layer of this 2 x2x K array is represented below. 





Performance on the Studied Item 




Group 


Correct = 1 


Incorrect = 0 


Total 


Reference 


A, 


Bk 


f^Rk 


Focal 


c, 




^Fk 


Total 




^Ok 


Tk 



In this notation, there are examinees with the same value of the matching variable. Of 
these, rif^i, are in the reference group and /i^^ are in the focal group. Of the rif^i, reference group 
members, Aj, answered the studied item correctly while did not. Similarly Q of the rif,, 
matched focal group members answered the studied item correctly, whereas D,, did not. The 
MH measure of DIF is defined as 

MH D-DIF = -2.35 /n(6c^..) (2) 
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where & is the Mantel-Haenszel conditional odds-ratio estimator given by 



E A, DJT, 

& = * . (3) 



MH 



The transformation of & in (2) places MH D-DIF on the ETS delta scale of item difficulty 

MH 

(Holland & Thayer, 1985). The effect of the minus sign in (2) is to make MH D-DIF 
negative when the item is more difficult for members of the focal group than it is for 
comparable members of the reference group. An estimated standard error for MH D-DIF is 
given in Holland and Thayer (1988), based on work reported in Robins, Breslow and 
Greenland (1986) and Phillips and Holland (1987). It is 



SE{MH D-DIF) = 235^ Var{ln{6.^f,)) (4) 



where Var(//i(&^^)) is estimated by 

E U, VJTl 



1{Y, A, DJT,f 

k 



(5) 



where 



V, = (A, . D,) ^ &,,{B, . Q. 



(6) 
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The Mantel-Haenszel chi-square test of the null hypothesis of no difference between 
the performance of the focal group and of comparable members of the reference group on the 
studied item was not examined in our study. 

The standardization DIF measure, developed by Dorans and Kulick (1986), is 

STD P-DIF ^ - Pj, 

where is the proportion in the focal group who get the studied item correct, and is an 
adjusted proportion correct on the item for the reference group, defined as 

p,.L(J^)^ ' (8) 

where n - T, n is the total number of examinees in the focal group. One interpretation of 

*F Fk 
k 

is that it is the proportion of reference group examinees who would have got the studied 
item right had the distribution of the matching variable in the reference group been the same 
as it is for the focal group/ 

The estimated standard error for STD P-DIF is given by the formula 

SE(STD P-DIF) = y + cl (9) 

^When rij^^ is equal to zero, both and cl are undefined. When this occurs, the 
standard ETS DIF software implements an imputation procedure proposed by Holland 
(McHale, Dorans, Holland, & Petersen, 1988). Analogous procedures, modified to take into 
account the special nature of the CAT-based analyses, were used in our work. 
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where 



al - ± p, (1 - p,) (10) 



and 



2 3 



Our matching variable for the DIF analysis of the CAT-administered items was 
obtained by (1) getting the examinee's MLE of ability, based on the responses to the 25 CAT 
items and (2) using this MLE, along with the estimated item parameters, to compute an 
expected true score on all pool items by summing the 75 values of the estimated item 
response functions. That is, the matching variable was 



Expected true score based on CAT ^ I Pj [Q^y). ^^^^ 

7=1 



where p.(-) is an estimate of the function defmed in equation 1 and is the MLE of 
ability based on the CAT items. Examinees whose expected true scores fell in the same one- 
unit intervals were considered to be matched. For the pretest items, which were administered 
nonadaptively, the matching variable was the sum of the expected true score on the CAT, 
computed according to equation 12, nnd the score (0 or 1) on the studied pretest item. 
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4. Organization of the data for DIF analysis 

4. 1 Examinee records 

The record that was consirucied for each examinee contained the following 
information: population indicator (either reference or one of 3 types of focal), true ability, 
pool indicator (1, 2, or 3), string of 75 responses to all pool items, item numbers of CAT 
items administered (in order), string of 15 pretest item responses, estimated ability for the 75- 
item nonadaptive test and for the 25-item CAT, and expected true scores corresponding to 
each of the two ability estimates. 

Generating responses to all 75 pool items had two purposes: (1) These responses 
could be used for the "nonadaptive control" part of the study, which attempted to distinguish 
the effects of using CAT data from the effects of using expected true score as a matching 
variable and (2) the responses could be used to construct additional CATS for the examinees 
if desired by using the CAT algorithm to generate a CAT sequence and plugging in the 
existing item responses. Although we did not make use of (2), constructing the record in this 
way makes it possible for us to generate data less expensively in future research. 

4.2 Definition of sample size conditions 

In our CAT setting, it was not clear how best to define sample size for purposes of 
data simulation and analysis. If groups of a fixed sample size were drawn and the CAT 
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administered, the sample sizes per item would have a huge range. For example, in Table A- 
2, the range of item sample sizes is from 0 to 51,133 (out of a total of 60,000) for the focal 
group in Conditions 3 and 4. Because our goal was to investigate the behavior of selected 
DIF statistics under a fixed sample size, simply analyzing the available data for each item 
was clearly undesirable. We therefore considered several other approaches. Initially, we 
attempted to generate enough data to meet the target item sample sizes for all conditions. 
This implied that 900 reference group members were needed, along with 500 members of 
each of the three focal groups for each of the three pools (see Table 1). To achieve this goal 
for most items required generating 60,000 cases for the reference group and for each of the 
nine focal distribution by item pool combinations. To assess variability, we planned to 
conduct two replications per condition. 

After examining the DIF results from this approach, we concluded that the standard 
errors of the DIF statistics were large enough to make it difficult to characterize the behavior 
of the statistics for different item types and different conditions. Even averaging across two 
replications did not appear adequate. Because of the cost of data generation, we did not wish 
to simulate additional data. We considered several resampling approaches, which would have 
allowed us to obtain multiple estimates of each statistic, but none seemed ideal for our 
purpose. The approach we ultimately decided to use, proposed by Charles Lewis, was as 
follows: For each item, we used all the available CAT data (out of a maximum of 60,000 
responses per group) to form the 2 (item responses) x 2 (groups) x K (levels of the matching 
variable) contingency table needed for DIF analysis (see section 3.2). We then converted the 
table frequencies to proportions of the total number of observations. Using these proportions 

32 



30 

as estimates of the population probabilities associated with the 4 x cells for the particular 
configuration of conditions in question, we obtained expected tables for our target sample 
sizes by multiplying the probability estimates for focal group cells by the desired focal group 
sample size and then doing the same for the reference group cells. Next, we computed DIF 
statistics and standard errors, based on the expected tables, for all 18 conditions in Table 1. 
(Note that the estimate of the STD P-DIF statistic obtained using the expected table approach 
is the same as the value obtained using all available data, regardless of the target sample 
sizes.) 

As a simple example of the expected table (ET) approach, consider the following 

hypothetical data for a single item, assuming that there are only two levels of the matching 

variable. The first step would be to use all the data available for the item to construct a 2 x 2 

X 2 frequency table (because = 2 here). Then the frequencies for the reference group 

would be divided by the total number of reference group examinees and the frequencies for 

the focal group would be divided by the total number of focal group examinees, producing 

the following 2x2x2 table of probabilities: 

Low on Matching Variable 



Right Wrong Total 

Reference .2 .1 .3 

Focal .2 .2 .4 

High on Matching Variable 

Right Wrong Total 

Reference .5 .2 .7 

Focal .4 .2 .6 
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Now assume that we wanted target tables for the Hr = 900, np = 100 condition. The reference 
group probabilities would be multiplied by 900 and the focal group probabilities would be 
multiplied by 100, producing the following table, which would then be used for DIF analysis. 

Low on Matching Variable 
Right Wrong Total 

Reference 180 90 270 

Focal 20 20 40 

High on Matching Variable 
Right Wrong Total 

Reference 450 180 630 

Focal 40 20 60 

For the MH D-DIF statistic, the formula for the standard error of the ET estimate, 
SE^MH D-DIF), is given in Appendix B. For the sample size conditions we investigated, its 
value is very similar to the value of SE{MH D-DIF) (equations 4-6) obtained using all the 
available data. For the STD P-DIF, the standard error of the ET estimate, SE^STD P-DIF), 
is identical to the value of SE{STD P-DIF) (equations 9-11) obtained using all the data; 
therefore, no special computing formula is required. SE^MH D-DIF) and SE^STD P-DIF) 
are typically much smaller than the ordinary standard errors that would be obtained for the 
target sample sizes in question.* Therefore, even though it produces only a single estimate, 

^SEirjiMH D-DIF) and SE^STD P-DIF) reflect the degree of precision with which the 
population value is estimated using the ET approach. Because the ET estimates are typically 
based on thousands of cases in this study, these standard errors tend to be small. They should 
not be confused with the standard errors that are computed based on the expected tables 
generated with the ET approach, using the usual formulas (equations 4-6 and equations 9-11). 
This second type of standard error (which does not have an "ET" subscript) closely 



the ET approach can provide a relatively precise idea of the behavior of the DIF statistics. A 
comparison of the ET method to an estimation procedure based on multiple replications 
appears in Appendix C. Our comparison was based on pretest items, for which 60,000 
responses per population group were available for each item. As shown in Table C-1, the ET 
method was found to give results similar to those of the replication-based approach. For the 
items we studied, the ET estimate of MH D-DIF was as precise as the average over 316 
replications of the MH D-DIF statistic based on the target sample sizes. Another advantage 
of the ET approach is that, once the 2 x 2 x A' probability tables have been created, DIF 
results can be generated easily for any target sample size. This will be useful if we wish to 
consider other sample size conditions in the future. 

5. Results 

The results of the study are summarized in the following sections. Section 5.1 gives 
results for the items in the 75-item CAT pool, section 5.2 gives results for the pretest items, 
and section 5.3 gives some results on ability estimation for examinee groups. 

5.1 Results for the CAT pool items 

Results are given first for the comparison of CAT-based DIF results to nonadaptive 
DIF analyses. Correlations between CAT-based DIF statistics, DIF statistics based on 



approximates the standard errors that would be obtained using actual samples of the target 
sizes. 
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nonadaptive administration, and "true DIF" are presented first, along with the means of the 
various DIF measures. For purposes of this analysis, true DIP was defined as the product of 
the item discrimination parameter {a) and the difference between the item difficulties for the 
reference and focal groups {d). The theoretical rationale for defining true DEF in this way is 
given in section 2.3.1. Following this, tables of MH D-DIF and STD P-DIF means for every 
combination of ad and b are given, along with a discussion of the standard errors of the DIF 
statistics. Finally, an estimate is given of the proportion of times each type of item would be 
declared an extreme DIF ("C") item using the ETS method of classifying items into DIF 
categories. 

5.1.1 Comparison of CAT-based and nonadaptive DIF analyses 

For selected simulation conditions, we compared MH and standardization results from 
the CAT analyses, described above, to results of two nonadaptive DIF analyses. The first was 
a procedure (§-75) in which all 75 pool items were "administered" and examinees were 
matched on expected true score calculated using the MLE of ability based on all 75 responses 
(the "nonadaptive control"). That is, instead of the matching variable in equation 12, the 
matching variable was 

Expected true score based on all 75 items = I p. (13) 
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where is the MLE of ability based on all 75 items. The second approach (NR) was a 
conventional DIF analysis, in which all 75 pool items were administered and examinees were 
matched on number-right score. The results of this comparison are given in Tables 4-6. 

Insert Tables 4-6 about here. 

For this analysis, we chose to include only the simulation conditions that had DIF and 
were based on reference and focal sample sizes of 500-that is. Conditions 4, 6, 10, 12, 16, 
and 18 (see Table 1). For each of the six conditions, the correlation matrix was computed for 
four variables: the three types of DIF statistics and the true DIF for the item. Each 
correlation matrix was based on the 7 1 items that were administered in the CATs (see section 
2.3). 

The CAT-based MH D-DIF and STD P-DIF statistics used in this analysis were 
computed using the ET method, while the two other statistics were computed based on actual 
samples of 500 from the reference and focal groups. Therefore, for most items, the CAT 
statistics were much more precisely determined. To avoid giving a spuriously inflated 
impression of the performance of the CAT analyses, we computed correlations that were 
corrected for unreliability, using the following formula: 

C „ ^XY (14) 

YXX * ^YY 

where r^y corrected correlation between X and F, r^y is the ordinary Pearson correlation 
between X and r, and r^oc and Vyy are the reliabilities of X and Y. For a particular type of DIF 
SI Uistic {MH D-^DIF or STD P-DIF). reliability was estimated as 
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Reliability = 1 - 



Z 5£/ [DIF statistic)/ J 

I 

Variance across J items of DIF statistic 
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(15) 



where J is the number of items. The numerator on the right-hand side represents error 
variance, while the denominator represents total variance. (For the CAT DIF statistics, the 
SE^-{^ values were the squares of the SE^MH D^DIF) or SEaiSTD P-DIF) values as 
appropriate; see footnote 6. The reliability of ad is, of course, unity, since it is not a 
statistic.) These corrected correlations provide a more equitable way of comparing the CAT, 
6-75, and NR analyses.^ 

Both uncorrected and corrected intercorrelations of the values of the MH D-DIF 
statistic for the three types of matching variables and the values of the true DIF are given in 
Table 4 for each of the six conditions. The median across conditions is also given. The 
corresponding information for STD P-DIF is given in Table 5. Both Tables 4 and 5 show 
that the CAT, §-75, and NR analyses produced results that were highly correlated with each 
other and with the true DIF values. In particular, the two analyses based on all 75 item 
responses produced virtually identical results (with corrected correlations exceeding unity). 
The median corrected correlation with true DIF was about the same for the CAT, 6-75, and 
NR analyses, which is somewhat surprising since the CAT DDF approach matches examinees 
on the basis of only 25 item responses. In general, correlations tended to be slightly higher 
for the MH D-DIF statistics than for STD P-DIF. The near-unity correlations of the CAT 
DIF statistics with true DDF was a welcome finding. 



^ir reliability is underestimated, the corrected correlation in equation 14 can exceed unity. 
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For some simulation conditions, we have direct evidence that use of the ET method 
gives similar correlation results on the performance of the CAT-based DIF procedure as does 
an analysis based on actual samples of the target sizes. For Condition 6, we compared MH 
results based on actual samples of 500 per group to the ET results. Based on the samples of 
500, the uncorrected correlations of the CAT MH D-DIF statistics with MH D-DIF values 
from the 0-75 and NR procedures were .88 and .87, respectively-the same as for the ET- 
based CAT statistics. Based on the samples of 500, the uncorrected correlation of the CAT 
MH D-DIF statistics with true DIF was .92. compared to .95 for the ET method. 

High correlations alone, however, do not ensure the accuracy of the DIF methods. To 
determine whether the obtained statistics had the desired means, we computed, for each 
analysis strategy in each simulation condition, the mean MH D-DIF and STD P-DIF values 
across the 71 items that were given in the CAT, along with the standard deviation across 
items. The results are given in Table 6, along with the medians over the six simulation 
conditions, (In the case of STD P-DIF, means and standard deviations have been multiplied 
by 10.) The mean across 71 items of the true DIF values is -.004 in Conditions 4, 10 and 16 
(Pool 2) and -.001 in conditions 6, 12, and 18 (Pool 3), with a standard deviation of .293 in 
both pools. 

In MH D-DIF analysis in which all examinees take all items and the matching variable 
is number-right score, the average MH D-DIF is constrained to be approximately zero across 
items, producing a negative covariance among the DIF i».«tistics within a test. If it were not 
for rounding error and fo"r the adjustment procedure described in Footnote 2, the STD P-DIF 
statistics would sum to zero under these conditions as well. This constraint on the MH D- 
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DJF and STD P-DIF is not present in the CAT and &-75 analyses. In these other types of 
DIF analysis, the nature of the covariance across DIF statistics within a test is unknown. 

The issue of covariances across DIF statistics is relevant to Table 6 for two reasons: 
First, because of the constraint on the mean of the NR-based statistics, it is not clear which 
across-item NR mean is the most useful for comparison to other analyses: the one based on 
only the 71 items given in the CAT or the mean over 75 items. Both these means (and 
accompanying standard deviations) are therefore included in Table 6. Second, the non-zero 
covariances for the NR-based statistics and possibly for the other analyses makes it difficult 
to estimate the standard errors of the means in Table 6. If the MH D-DIF statistics were 
independent across items, the standard errors of the av^^rage MH D-DIF statistics in Table 6 
would be roughly .009 for the CAT analysis and .049 for the two nonadaptive analyses 
(obtained by dividing the average item-level standard error by the square root of the number 
of items). Judged in this light, the means for the nonadaptive procedures were quite close to 
zero, but the means for the CAT procedure were slightly inflated. All six means for the 
CAT-based procedure were greater than zero and the means were larger for the Pool 3 
conditions than for the Pool 2 conditions. However, these values for the standard error of the 
mean are only approximate. Because of the negative covariances among NR MH D-DIF 
statistics within a test, the value of .049 is definitely an overestimate of the standard error of 
the mean for the NR analyses. Presumably, this overestimation holds for the Q-75 approach, 
which pioduced results nearly identical to the NR analyses. For the CAT DIF statistics, the 
value computed under independence may either under- or overestimate the standard error. In 
any case, the practical implications of an inflation of .01 to .05 in the MH D-DIF statistic are 
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small in that a difference this size is unlikely to have much effect on decisions about the 
item. (Of course, it would be possible to rescale the statistics so that they would be centered 
on zero for a particular collection of items.) Under independence, the standard errors of the 
means for STD P-DIF x 10 in Table 6 would be about .008 for the CAT analysis and .03 for 
the two nonadaptive approaches (obtained, once again, by dividing the average standard error 
by the square root of the number of items). Again, there appears to be a slight inflation of 
the statistics in the CAT analysis. There were also relatively large departures from zero for 
the two nonadaptive methods in Condition 4. 

In addition to comparing the values of MH D-DIF and STD P-DIF for the three 
matching variables, we also examined their standard errors. For the 9-75 and NR analyses, 
the average values of SE(MH D-DIF) within each condition were about .40, whereas the 
CAT-based MH D-DIF statistics tended to have standard errors of about .35. One hypothesis 
for this discrepancy is that the smaller standard errors for the CAT DIP analysis are related, 
at least in part, to the use of the ET estimation method. Table C-1 shows that the ET-based 
estimates of SE(MH D-DIF) tended to be smaller than the average SE(MH D-DIF) across 66 
replications by about .03. Another hypothesis is that CAT-based methods of DIP analysis 
tend to produce lower standard errors for reasons unrelated to ET estimation, such as the 
restriction of the analysis to examinees in a smaller ability range (see the related discussion in 
section 5.2.2). The interpretation of the standard error results for the three DIP analyses is 
complicated, however, by the fact that the pattern of standard errors for MH D-DIF is not 
paralleled by the restilts for SE(STD P-DIF), where average standard errors (multiplied by 10) 
ranged from .25 to 32. Here, the average SE(STD P-DIF) within a condition did not vary 
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much across the NR and CAT DIF statistics. Although the differences were small, 
howe/er, the values of SE(STD P-DIF) were larger for the CAT approach than for the 0-75 
and NR approaches for all six simulation conditions. These standard error findings, along 
with those described in section 5.2.2, require further investigation. 

5.1.2 MH D'DIF and STD P-DIF statistics by item type 

In addition to comparing the CAT approach to nonadaptive DIF analysis methods, we 
examined the average CAT DIF statistics for various types of items and simulation factors. 
To determine the best way to summarize the results, we conducted a series of analyses of 
variance (ANOVAs) in which the observations were the DIF statistics and the independent 
variables were sample size condition, focal group distribution, item difficulty level (B), item 
discrimination (A), item DIF level (D), and item position. Pool 1 was analyzed separately; 
Pools 2 and 3 were analyzed both separately and in combination (with pool as an additional 
independent variable). We began with th^ MH D-DIF statistics, which we analyzed under 
several different assumptions concerning interactions among the independent variables and 
several different numbers of levels of item difficulty, item position, and DIF. Results were 
quite consistent across the analysis models. In Pool 1, only the B effect was significant at an 
a of .01;^ it explained less than i% of the variance in the MH D-DIF statistics. In Pools 2 
and 3, D explained about 85% of the variance. Most analyses of Pools 2 and 3 showed very 



^As in all exploratory analyses, significance testing can be viewed here only as a rough 
tool for ranldng the size of effects. 
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small, but statistically significant effects of B, and of the B x D and A x D interactions. 
Somewhat surprisingly, focal group distribution had, essentially, no effect, nor did it interact 
with other variables. Sample size too, had no effect. (Since the results for the two sample 
size conditions were generated from the same set of expected tables, they were highly 
correlated. The main value of generating results for two sample size conditions was that it 
allowed the examination of the behavior of the standard errors of the DIF statistics, discussed 
below.) Item position and pool never yielded statistically significant main effects, though 
these factors sometimes showed tiny interactions with other variables. We conducted similar 
analyses for the STD P-DIF statistics and obtained nearly identical results. Based on the 
ANOVA findings, we displayed MH D-DIF and STD P-DIF averages for every combination 
of ad and b. 

5.1.2.1 MH D'DIF vtsulxs 

The average MH D-DIF statistics are given in Tables 7-9 for Pools 1, 2, and 3, 
respectively. Results are given for the n^^ = 500, = 500 sample size condition only. As 
noted, results were nearly identical for the two sample size conditions. The average standard 
error of the estimate, SE^MH D-DIF), is given as well. As described earlier, computation of 
the standard error of the mean DEF statistics is not straightforward. The average value of 
SE^MH D DIF), given in parentheses in Tables 7-9, is the maximum value that the standard 
error of the mean MH D-DIF could take (i.e., the value that would occur if all items had 
intercorrelations of one) and therefore yields an overestimate of the standard error of the 
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mean. The third entry in each cell in Table 7-9 is the number of item results contributing to 
the average. Since results are averaged over the three focal group distributions, a single item 
within a pool generates three entries in the table in which it occurs. The total number of 
entries in each of Tables 7-9 is 3 (focal group distributions) x 71 (items administered in the 
CAT) = 213. 

As shown in Table 7, the MH D-DIF statistics were well-behaved in the null case; 
ihey were equal to zero at the tabled level of accuracy. The bottom margins of Tables 8 and 
9 show that the average value of MH D-DIF was typically about 3.3 times the value of ad in 
Pool 2, and 3 times the value of ad in Pool 3 (compared to the theoretical value of 4ad that 
holds in the Rasch case described in section 2.3.1). In Pool 2, Table 8 shows that, for a fixed 
value of ad, the average MH D-DIF usually decreased in absolute value as b increased. For 
example, for ad = -.35, the average MH D-DIF was -1.3 for ^ = -1.95, -1.2 for ^ = 0, and 
-0.7 for b = 1.95. This phenomenon, noted by Donoghue, Holland, and Thayer (1993), occurs 
in simulations in which the guessing parameter c is constrained to be the same in the 
reference and focal groups. The more difficult the item, the closer the probability of correct 
response is to the guessing value, and the harder the groups are to differentiate. 
Superimposed on this phenomenon. Pool 3 (Table 9) included a correlation between the 
difficulty and DIF parameters. Easier items in Pool 3 are more likely to have negative DIF 
than harder items. The relation between MH D-DIF and b for fixed ad was not as evident in 
Pool 3 as it was in Pool 2. Also, the average MH D-DIF for the DIFless items were not as 
close to zero as they were in Pools 1 and 2. For d = 0, the average MH D-DIF decreased 
from 0.3 to -0.5 as b increased from -1.95 to 1.95. One item that showed surprising behavior 
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in Pool 3 was item 75, which appears in the bottom row of the body of Table 9 at the far 
right. The average MH D-DIF departs considerably from 3ad^ 2.10. At first, we 
hypothesized that this was because item 75 was one of the items that were administered 
randomly to some examinees at the beginning of the CAT (see section 2.2). However, we 
found that item 75 had an unusually small MH D-DIF value even when administered 
nonadaptively. The most likely explanation is that the small DEF value is related to the 
extreme difficulty of the item. 

Insert Tables 7-9 about here. 

The values of SE(MH D-DIF) varied little across pools, DIF levels, item difficulty, or 
item discrimination. The primary determinant of SE{MH D-DIF) was sample size. For the 
rif^ = 500, Up = 500 condition, SE{MH D-DIF) ranged from about 0.3 to 0.4; for the = 900, 
rif = 100 condition, the range was from about 0.5 to 0.7. 

5.1.2.2 STD P-DIF rtsulls 

STD P DIF results are given in Tables 10-12 for Pools 1, 2, and 3, respectively. The 
STD P'DIF statistics, as well as the values of SE^STD P-DIF), have been multiplied by 10. 
Results are given for the = 500, rif. = 500 sample size condition only. As noted eariier, ET 
estimates of the STD P-DIF statistics do not depend on the target sample sizes; therefore, the 
results for the rif^ = 900, rij, = 100 sample size condition were identical. The average value of 
values of SE^STD P-^DIF) (x 10) is given as the second entry in each cell of the tables. As 
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noted, this value yields an overestimate of the standard error of the mean. The third entry in 
each cell in Tables 10-12 is the number of item results contributing to the average. The third 
entries are the same as those in the MH D-DIF results (Tables 7-9). 

Table 10 shows that, in Pool 1, the STD P-DIF statistics were close to zero, as 
desired. The bottom margins of Tables 11 and 12 show that the average values of STD P- 
DIF X 10 were roughly 2.7 times the value of ad in Pool 2, and 2.5 times the value of ad in 
Pool 3. Table 11 shows that, unlike MH D-DIF, STD P-DIF did not tend to decrease in 
absolute value as b increased for a fixed value of ad. An aspect of the results that did mirror 
the MH D-DIF results was that the average STD P-DIF for the DIFless items in Pool 3 
(Table 12) were not as close to zero as they were in Pools 1 and 2. For d = 0, the average 
value of STD P-DIF x 10 was 0.21 at = -1.95 and 0.22 at = -1.30. It then decreased as b 
increased, reaching -0.46 for b = 1.95. Also, as in the MH D-DIF results, item 75 in Pool 3, 
which appears in the bottom row of the body of Table 12 at the far right, had a smaller DIP 
statistic than expected. 

Insert Tables 10-12 about here. 

The values of SE(STD P-DIF) varied litUe across pools, DIP levels, item difficulty, or 
item discrimination. As in the case of SE{MH D-DIF), the primary determinant of SE{STD 
P-DIF) was sample size. For the = 500, = 500 condition, SE{STD P-DIF) x 10 was 
always about 0.3; for the = 900, «f = 100 condition, it was about 0.5. 
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5.1.3 Estimated percent of "C" results for item types 

ETS has a system for categorizing the severity of DIF based on MH results. 
According to this classification scheme, a "C" categorization, which represents extreme DIF, 
requires that the absolute value oi MH D DIF be at least 1.5 and be significantly .greater than 
1 (at a = .05). A "B" categorization, which indicates moderate DIF, requires that MH D-DIF 
be significantly different from zero (at a = .05) and that the absolute value of MH D-DIF be 
at least 1, but not large enough to satisfy the requirements for a C item. Items that do not 
meet the requirements for either the B or the C categories are labeled "A" items, which are 
considered to be free of DIF. 

Because most of the ET estimates of MH D-DIF and SE{MH D-DIF) statistics are 
based on at least 10,000 observations, it is reasonable to assume that they provide precise 
estimates of the population mean and standard deviation of the MH D-DIF statistic for the 
relevant configuration of item properties and simulation conditions. This is supported by the 
analysis described in Appendix C. If it is assumed that the MH D-DIF statistics for this 
configuration are normally distributed with this mean and standard deviation, percentiles of 
the MH D-DIF distribution can be obtained. These percentiles can then be used to estimate 
the percent of times such aii item will be classified as an A, B, or C item.^ This is an 
alternative way of providing information about the sampling variation of the MH D-DIF 
statistic. 



^his approach was suggested by Charles Lewis. 
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Based on the ETS DIF rules, we developed an algorithm for estimating these percents, 
to be applied separately to each item in each condition (see Appendix D). The algorithm was 
tested and found to work well with data for 15 items from our simulation, using the ET 
estimates of MH D-DIF and SE(MH D-DIF) to approximate the mean and standard deviation 
of the MH D-DIF distribution. Details and results are given in Table C-1 in Appendix C. 
The algorithm was also tested on data from the simulation study of Donoghue, Holland, and 
Thayer (1993) consisting of 100 replications of the MH D-DIF, SE{MH D-DIF) and MH chi- 
square statistics for six different items. For each item, the estimated percents of A, B, and C 
results based on the method of Appendix D (using the average over 100 replications of MH 
D-DIF and SE{MH D-DIF) to estimate the mean and standard deviation of the MH D-DIF 
distribution) matched very closely the actual percents of A, B, and C results in the 100 
replications. 

Tables 13-15 give the average expected percent of C results for each combination of 
ad and fo. The first entry in each cell is the average expected percent of C results for the 
fif^ = 900, fif: = 100 condition, the second entry is the average expected percent for the 
rio - 500 = 500 condition, and the third entry is the number of item results contributing 
to the average. 

Insert Tables 13-15 about here. 

Table 13 shows that the percents of C results were close to zero in the null case, as 
desired. Even in the worst case {b = -1.95, n« = 500, = 500), the average expected percent 
of C results was only 0.2. As noted earlier, an item must have an MH D-DIF with a 
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magnitude exceeding 1.5 in order to be a C item. Because MH D-DIF was found to be 
approximately equal to 3ad in the conditions investigated in our study, items with ad = ±.70 
and ad = ±.52 can be regarded as nominal C items. The bottom margin of Table 14 shows 
that with samples of 500 in each group. Pool 2 items with ad = ±.70 would nearly always be 
identified as C items. Those with ad = ±.52 would be expected to be so labeled at least three 
quarters of the time. As anticipated, the power to detect extreme DIF items was substantially 
smaller for the = 900, = 100 sample size conditions. Table 15 shows smaUer detection 
rates for the nominal C items in Pool 3. As noted, item 75, which has a difficulty of 1.95, 
had a smaller MH D-DIF value than expected; therefore, its average expected percent of C 
results was also smaller. The three items with ad =±.52 also had considerably smaller 
detection rates than the ad =±.52 items in Pool 2. 

5.2 Results for the pretest items 

For several reasons, results for the pretest items must be interpreted differently from 
the results for the CAT items. First, the pretest items were administered nonadaptively. 
Second, the pretest items were identical for all examinees, regardless of which CAT pool was 
administered. Therefore, the identity of the CAT pool is relevant only because the matching 
variable for the pretest items was a function of the expected true score on the CAT. Third, 
the properties of the pretest items follow a balanced design. Specifically, three levels of b 
were crossed with five levels of d, and a was equal to 1 for all items. 
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5.2.1 MH D-DIF and STD P-DIF for pretest items 

To determine how to summarize the results of the pretest items, an ANOVA was 
conducted using the MH D-DIF statistics as the observations. (Because analyses on the CAT 
pool items showed that ANOVAs of MH D-DIF and STD P-DIF led to nearly identical 
results, no ANOVA was conducted for STD P-DIF for the pretest items.) The factors 
included were focal group distribution (F), item pool (P), item difficulty (B), and item DEF 
level (D). All two-factor and three-factor interactions were also assessed. Results were 
somewhat different from those obtained for the CAT items. All effects were statistically 
significant at a = .01 except for the P x D, F x P x D, F x B x D, and P x B x D 
interactions. However, the only effects that explained more than 1% of the variance were D 
(92%) and B X D (3%). Therefore, for simplicity, results were tabled in the same way as the 
CAT item results; that is, results were displayed for each combination of ad = d and b. 

Results for the pretest items are given in Tables 16-20. Although the pretest items 
were not subject to problems of variability in sample sizes across items, we used the ET 
estimation procedure for these items, as for the CAT items. Because the MH D-DIF and 
STD P-DIF statistics were nearly identical across pools, results for these statistics are given 
for Pool 1 only (Tables 16 and 17). Results are shown only for the = 500, Kf = 500 



48 

condition. 

Insert Tables 16-20 about here. 

In all three pools, items without DIF had DIF statistics of about zero, as desired. For 
items with DIF, MH D-DIF, but not STD P-DIF, tended to decrease in absolute value as b 
increased for a fixed value of ad. The DIF statistics for the pretest items tended to be 
slightly smaller than the corresponding statistics for the CAT items. 

As an additional check on the DIF results for the pretest items, the correlation matrix 
for MH D'DIF, STD P-DIF and ad was obtained within each of the three pools. The three 
correlation matrices were nearly the same. The correlation between the two DIF statistics 
was .95, the correlation between MH D-DIF and ad was .96, and the correlation between STD 
P'DIF and ad was .94 to .95. (Because all the DIF statistics for pretest items were based on 
the ET method, reliabilities were close to unity. Therefore, the corrected correlations 
obtained using equation 14 were almost identical to the uncorrected correlations.) 

5.2.2 SE(MH D-^DIF) and SEiSTD P-DIF) for pretest items 

The most pronounced difference between the pretest and CAT item results was the 
size of the standard error of MH D-DIF. While the values of SE(STD P-DIF) x 10 for pretest 
items were, on the average, slightly smaller than those for CAT items (0.2 to 0.3 for the n^^ = 
500, rif. = 500 condition and 0.3 to 0.5 for the = 900, 100 condition, compared to 
fairly consistent values of 0.3 and 0.5, respectively, for the two sample size conditions in the 
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CAT), the standard errors of MH D-DIF tended to be considerably larger for the pretest items 
than for the CAT items (ranging from about 0.4 to 0.6 for the = 500, = 500 condition 
and from 0.6 to 1.1 for the 1%^ = 900, 100 condition, compared to 0.3 to 0.4 and 0.5 to 
0.7, respectively, for the two sample size conditions in the CAT). 

There are several factors that may have contributed to the larger standard errors. First, 
they may be related to the larger group differences in ability distributions in the pretest 
compared to the CAT. The pretest items are administered to all examinees, whereas the CAT 
is administered to those within a relatively narrow range of ability. Therefore, the pretest 
data are distributed across more levels of the matching variable, and this greater sparseness 
may lead to inflation of the standard errors (see the related discussion in section 5.1.1). A 
second possible reason for the larger standard errors is the definition of the matching variable 
in the pretest analyses. The estimated standard error of MH D-DIF has been found to be 
inflated by inclusion of the studied item in the matching variable (Donoghue, Holland, & 
Thayer, 1993). Therefore, it is possible that the larger estimated standard errors in the pretest 
analyses resulted from the nonstandard method of including the studied item (i.e., adding the 
studied item score to an expected true score based on the remaining items). A third possible 
contributing factor is the relation between item difficulty and SE(MH D-DIF). This 
phenomenon was also investigated by Donoghue, Holland, and Thayer (1993), who studied 
the behavior of the ratio of the average SE(MH D-DIF) over 100 replications to the standard 
deviation of MH D-DIF over replications. They found that this ratio was larger for items 
with difficulty (b) parameters of -.5 and +.5 than for those with b = 0. Our examination of 
their data revealed that the average SE(MH D-DIF), like the ratio, was larger for the lower 
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and higher difficulty levels than for the middle difficult)' level. The standard deviation of 
MH D-DIF had the opposite pattern: It v^as smaller for 6 = -.5 and 6 = +.5 than for b - 0. 
In investigating the pretest items that had unusually large standard errors, we found that these 
items had very large percents correct (above 85). In ever>' simulation condition, items 4 and 
5 had the largest values of SE(MH D-DIF). These were the easiest of the pretest items, with 
i> = -1.3 and = .35 and .70, respectively. A detailed examination of pretest items for 
examinees in Condition 17 showed that the Spearman correlation between item percent 
correct and SE(MH D-DIF) was .88. In other conditions, such as Condition 5, the relation 
took a curvilinear form, which is consistent with the findings of Donoghue, Holland, and 
Thayer (1993). In general, whether the relation was cur\dlinear or monotonia the items with 
the highest percents correct tended to have the highest values of SE(MH D-DIF). Because the 
CAT items were administered to examinees with a narrower range of ability, they rarely had 
percents correct over 75, which may, in part, explain their smaller standard errors. 

5.2.3 Expected percent of C results for pretest items 

The average expected percent of C results, given in Tables 18-20 for the three pools, 
was, of course, affected by the larger values of SE(MH D-DIF) for the pretest items, as well 
as the slight tendency of the DIF measures themselves to be slighUy smaller in the pretest 
than in the CAT. Results were quite similar for Pools 1 and 2, but were somewhat different 
for Pool 3 because of a slightly different pattern of standard errors for that pool. Given a 
particular value of ad and an item was more likely to be labeled a C item if it was a CAT 
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item than if it was a pretest item. Consider a Pool 3 item with b - 13 and ad = .70. A CAT 
item with these propeiiies would be expected to have a C label 92.4% of the time in the = 
500, rijr = 500 condition and 54.7% of the time in the = 900, 100 condition. The 
corresponding percents for a pretest item were 45.9% and 24.9%. 

5.3 Results of examinee ability estimation 

In addition to determining DIF results for groups of items, it is useful to examine the 
accuracy of estimation of examinee ability under various conditions. Table 21 gives, for each 
item pool and population group, the median and interquartile range of the residual obtained 
by subtracting the true ability used in data generation from the CAT-based ability estimate 
(6cat)- Table 22 provides the same information for the ability estimate based on responses to 
all 75 items (675). Because ability estimates for examinees with infinite MLEs have been set 
to ±10, means and standard deviations would be misleading. Each cell of these tables is 
based on 1,000 examinees. The standard error of the medians are about .02 for the CAT 
ability residuals (Table 21) and about .01 for the 75-item ability residuals (Table 22). 

Insert Table.^ 21-22 about here. 

The most striking finding in Tables 21-22 is that all the median residuals were 
negative. This appears to be the result of estimation bias due to the use of estimated, rather 
than true item parameters. A CAT simulation based on the true item parameters showed that 
use of oiif particular set of item parameter estimates led to a downward bias in the ability 
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estimates for reference group examinees. In general, however, the size of the bias is not 
easily characterized. The bias depends on the location of the population distribution and on 
the presence of DIF; therefore, it is not constant across the cells of Tables 21 and 22. 

Another finding was that,* as expected, the median residuals were nearly always closer 
to zero for §75. Only in the Pool 3, focal N(+.5, 1) cell was the median residual for Bcat 
slightly smaller in absolute value than the corresponding value for 675. The item-level DIF 
results, however, suggest that the slightly better ability estimation achieved by using all 75 
item responses did not substantially improve the behavior of the DIF statistics. Certain other 
results are difficult to interpret. For example, estimation appears to have been better in Pool 
2 than in Pool 1, which is surprising, given that Pool 1 is free of DIF. 

Estimation was worst in Pool 3, particularly for the focal N(-l, 1) group, where ability 
was underestimated by an average of nearly one tenth of a standard deviation (of true ability) 
in the CAT and by about one sixth of a standard deviation on the nonadaptive test. For the 
CAT, this is consistent with predictions, because, in Pool 3, the lowest-ability focal group 
gets the easiest items, which tend to have more negative DIF. The median residual was 
closer to zero for the N(0, 1) population and still closer for the N(+.5, 1) group. It is 
interesting that in the nonadaptive administration, the pattern of median residuals for Pool 3 
paralleled the pattern observed for the CAT: The N(-l, 1) focal group again had the largest 
median residual, followed in order by the N(0, 1) and N(+.5, 1) groups. The explanation may 
be that, even though all examinees receive all items in the nonadaptive administration, the 
most informative items are those that have difficulties close to the examinee's ability level. 
In the N(-l, 1) focal group, for example, the items that contribute the most to an examinee's 
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score are the easier items, which, in Pool 3, are more likely to have negative DIF. Other 
factors, such as the differential biases due to item parameter estimation, may have also 
contributed to the large Pool 3 residuals. 

6. Summary and discussion 

Our study was based on modified versions of the MH D-DIF and STD P-DV statistics 
for both computer-adaptive test items and nonadaptively administered "pretest" items. We 
eliminated from consideration a proposed DIF method based on comparison of item percents 
correct for examinees who received the items late in the CAT. A preliminary simulation 
showed that this method did not lead to adequate matching of examinees. 

Our findings, in general, appear to provide good news for testing programs that wish 
to establish DIF screening procedures for adaptively administered items. The CAT-based DIF 
statistics were found to be highly correlated with true DIF and with DIF measures based on 
nonadaptive administration. The mean DIF statistics tor each pool were close to their 
nominal value of zero, although the CAT-based statistics showed a slight inflation, 
particularly for Pool 3, in which DIF and difficulty were positively correlated. In general. 
Pool 3 DIF statistics were not quite as well-behaved as the Pool 2 statistics. The values of 
the DIF statistics for DIFless items in Pool 3 were not as close to zero and the detection rale 
for nominal C items was lower. 

The factors that affected the size of the MH D-DIF and STD P-DIF statistics, in 
general were the size of the true DIF, the item difficulty, and the interactions of item 
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difficulty and item discrimination, respectively, with the true DBF. Focal group distribution, 
item position, and sample size conditions had almost no effect. 

A finding that was useful, although not directly relevant to CATs, was that in 
nonadaptive administration of 75 items, matching on the expected true score based on the 
MLE of ability led to essentially the same results as matching on number-right score. The 
similarity between these approaches, however, may be substantially less for shorter tests. 

In most cases, examinee residual abilities for both the CAT and the 75-item 
nonadaptive test had medians close to zero within a population group and pool. The major 
exception was the focal N (-1, 1) group that received Pool 3 items; these examinees had a 
median residual of about -.1 for the CAT and -.06 for the nonadaptive test.^° (The standard 
deviation of true ability was unity in each population group.) The differences between 
median residuals for the CAT and those for the nonadaptive test were not large relative to 
their standard errors; therefore, our findings did not support the conjecture that CATs might 
be more disadvantageous than nonadaptive tests for lower-achieving groups when DIF and 
difficulty were positively correlated. 

Like the DIF statistics for CAT items, the pretest DIF statistics were well-behaved and 
had high correlations with true DIF. The pretest DIF statistics tended to be very slightiy 

^^As noted earlier, both the DIF statistics and the examinee residuals tended to show 
larger departures from tiieir target values in Pool 3, in which DIF and difficulty were 
positively correlated, than in Pool 2, in which they were uncorrected. The interpretation of 
this result is not clear-cut, however. The Pool 3 data set was created because of the empirical 
finding that DIF estimates are sometimes positively correlated with item difficulty estimates. 
This does not imply, however, that the appropriate daUi-generating model is one in which the 
true (and ordinarily unknown) DIF and difficulty parameters are correlated. In short, there is 
no solid evidence for determining whether the Pool 2 or the Pool 3 d^Ad gcnerating-model is 
more realistic. 
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smaller in magnitude than ihe DIF values for CAT items with the same item parameters. A 
more striking difference between the CAT and pretest results was that the standard errors of 
the MH D-DIF statistics tended to be larger for the pretest items than for the CAT items, 
further reducing the power to detect DIF. Possible reasons for the larger standard errors 
include the different method of constructing the matching variable, the greater sparseness of 
the data, and the occurrence of items with larger percents correct than in the CAT. Previous 
research has shown that all of these features can affect the size of SE(MH D-DIF). 

There are many questions that our study did not address. For example, because we 
used constant sample sizes in our simulation, we did not address the problem of insufficient 
item data that may arise when conducting DEF analyses of adaptively administered items 
(Miller, 1992). We did not consider methods for refining the DEF criterion by deleting DIF 
items and repeating the analysis, nor did we evaluate the effects of using different procedures, 
such as Bayesian methods, for estimating abilities or item parameters. Our conclusions apply 
to the case in which the data generation model and the estimation model are both based on 
the 3PL function. Also, our results may depend on our use of the expected true score for the 
item pool as our matching variable. Other types of scores may be of interest. For example, 
in some actual CATs, an expected true score is computed for a set of reference items that are 
not, in fact, included in the item pool. We did not consider CAT algorithms in which item 
selection is .lot determined solely by information, but is constrained by requirements 
concerning item type and content, nor did we examine the effects of complex staning 
algorithms used in some CATs to control the "exposure" of items. Finally, our study could 
not, of course, provide any data on the appropriateness of using item parameter estimates 
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obtained through paper-and-pencil or nonadaptive computer administration to estimate item 
information and examinee ability in a CAT setting. If administration mode (see Hetter, 
Segall, & Bloxom, 1992; Wainer & Mislevy, 1990) or item order and context (see Zwick, 
1991) affect the functioning of items, CAT-based ability estimation and hence DIF 
estimation will be impaired. 

6. 1 Opportunities for further research and applications 

The data files we have created will facilitate further research on CATS at a relatively 
low cost. First, since we generated responses to all items in all three pools, we can create 
new CATS for the examinees without repeating the step of generating examinee abilities and 
item responses. Second, with the 2 x 2 x A: tables of probabilities we generated for each of 
the 3 (pools) X 71 (administered items per pool) = 213 CAT items, as well as for the pretest 
items, we can create expected tables for any target sample sizes and compute DIF statistics on 
these tables without further data generation. An additional application of our work may 
involve the expected percent of B, and C results that can be computed according to 
Appendix D. It is likely that this method could be successfully applied to any large data set, 
such as the SAT. For example, the method could be used to predict the likelihood that an 
item would be categorized as a C item for various combinations of reference and focal group 
sample sizes. Viewing an item's DIF status as probabilistic, rather than deterministic, may be 
a fruitful way of evaluating DIF results. 
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Table A-1 
liem Information Table ' 

Ability Level 



Item 
Position 


-2.U 


1 0 
-1.0 


1 C 

~1.D 


1 A 

-1.4 


1 0 
-l.Z 


1 n 


-u.o 


-U.D 




.0 9 


0 0 
u.u 


0 2 

u.^ 


04 


0.6 


0.8 


1.0 


1.2 


1.4 


1.6 


1.8 


2.0 


1 


40 


40 


43 


43 


A O 

43 


43 


46 


4j 


A ^ 

4j 


jZ 


jZ 


S^ 
jD 


S6 
jO 


DU 




6Q 

U-' 


U-' 


69 


74 


74 


74 


2 


39 


39 


40 


41 


A 1 

41 


A 1 

41 


A C 

45 


4d 


/i^ 
40 


/IS 
4j 


s^ 
jD 


S7 
jZ 


S9 


S6 

Ju 


u^ 


6S 

U J 


6S 

U J 


6fi 

U«J 


69 


66 


73 


3 


42 


42 


41 


A O 

42 


4Z 


cn 
jU 




/IQ 

4o 


/IQ 
4o 


J4 


J4 


S/1 
j4 


UU 


S9 
j^ 


U J 


68 
uo 


66 


65 


66 


69 


75 


4 


41 


41 


42 


A A 

44 


A A 

44 


Ail 
4o 


AOi 

4y 


AC\ 

4y 


/IQ 

4y 


4o 


J3 


J J 


64 
D4 


64 

UH 


64 

UH 


66 
uu 


68 

uo 


67 


67 


67 


71 


5 


A A 

44 


A A 

44 


A A 

44 


4U 




AA 
44 


4J 




Al 
4 / 


JU 


SS 

J J 


SS 

J J 


S'^ 


62 


68 


60 


67 


74 


65 


73 


67 


6 


43 


A O 

43 


39 


39 


4d 


4Z 


/IQ 

4o 


Al 
4 / 


cc 

J J 


ss 

J J 


4j 


SI 

J 1 


S4 

JH 


9Q 


62 


67 


60 


68 


68 


65 


66 


7 


I 


Z 


Z 


jU 


/IQ 

4y 


4y 


/II 

41 


J J 






48 


64 


62 

u^ 


61 


29 


29 


29 


73 


73 


71 


69 


o 
0 


1 


1 


o 
0 


o 
0 


/in 

4U 


4j 


/17 
4 / 


4J 




J J 


SI 

J 1 


uu 


61 

U 1 


53 


56 


64 


64 


71 


71 


75 


65 


9 


c 


c 
J 




4y 




4o 


AA 
44 


^zl 

J4 




47 


4fi 

HU 


4S 


ss 

^ ■J 


59 


61 


62 


62 


60 


75 


68 


68 


10 


3 


o 
0 


1 


4d 


Q 


/IT 

4/ 


/lO 
4Z 


Zll 
41 


SI 
J 1 




47 


S7 
J / 


6'^ 

U J 


6'^ 

U J 


52 


61 


74 


29 


70 


70 


72 


1 1 

11 


4 


7 


c 

J 


Z 


A 

4j 


Q 
O 


JJ 


SI 


J J 


SI 
J 1 


40 
H^ 


61 

U 1 


51 


54 


59 


59 


61 


75 


60 


72 


70 


12 


o 
0 


9 


9 


n 

y 


4o 


JJ 


1 7 
1 / 


/I /I 

4*T 


JO 


SO 

JU 


S7 
J / 


48 


29 


28 


66 


28 


59 


62 


29 


32 


32 


13 


I 


D 


D 


1 


Al 
4 / 


1 7 
1 / 


^A 
J4 




S7 


S7 
J / 




62 


59 


65 


28 


63 


28 


70 


32 


35 


35 


1 A 

14 


0 


1 1 

1 1 


/ 


0 


0 


/in 

4U 


^1 
31 


J J 


4'^ 


22 


24 


63 


24 


68 


63 


26 


71 


64 


72 


36 


36 


1 C 
Ij 


Q 




1 1 
1 1 


7 


17 




Q 
o 


17 


17 


17 


50 


24 


28 


26 


26 


56 


26 


61 


62 


60 


38 


1 < 

Id 


1 1 
1 1 


4 


1 n 


c 

J 


9 


1 s 


1 S 


S7 


22 


24 


63 


22 


23 


58 


58 


58 


70 


59 


35 


29 


33 


1 / 


in 


J 




1 1 
1 1 


D 


14 


S7 


1 S 


15 


15 


23 


23 


26 


55 


67 


52 


63 


28 


64 


62 


34 


18 


50 


10 


46 


10 


55 


9 


14 


52 


41 


25 


61 


47 


57 


^ A 


CO 

j3 


7U 


TX 
ID 


CIO 
jZ 


so 


so 
jy 


60 
DU 


19 


14 


49 


4 


45 


14 


51 


12 


8 


44 


13 


64 


46 


58 


51 


54 


53 


58 


35 


61 


33 


29 


20 


18 


14 


3 


14 


15 


16 


16 


22 


13 


63 


25 


59 


22 


69 


24 


35 


35 


26 


36 


64 


59 


21 


16 


46 


14 


17 


7 


12 


53 


14 


42 


23 


17 


49 


45 


23 


23 


54 


32 


63 


28 


34 


62 


22 


49 


16 


15 


15 


10 


54 


13 


13 


25 


43 


62 


58 


48 


57 


55 


23 


75 


36 


26 


61 


61 


23 


12 


18 


17 


16 


11 


6 


9 


12 


14 


19 


60 


26 


25 


22 


51 


24 


36 


58 


63 


38 


28 


24 


15 


12 


16 


12 


1 


10 


22 


16 


24 


20 


58 


28 


19 


27 


27 


27 


56 


72 


33 


28 


26 


25 


17 


15 


12 


47 


16 


57 


18 


56 


12 


12 


59 


25 


47 


19 


22 


71 


27 


33 


34 


26 


64 



•For each ability level, the table lists the 25 most informative items, starting with the most informative. 
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Table A-5 

Frequency of Item Parameter Combinations in Pool 1 (75 Items) 

In a 



b -.30 0.00 



-1.95 5 2 

-1.30 6 4 

-0.65 7 6 

0.00 7 7 

0.65 6 7 

1.30 5 6 

1.95 2 5 
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Table A-6 

Frequency of Item Parameter Combinations for Pool 2 (75 Items) 



/n a = -0.30 d 



b 


-0.50 


-0.25 


0. 


025 


0.50 


Marginal 


-1.95 


0 


2 


2 


1 


0 


5 


-1.30 


0 


2 


2 


2 


0 


6 


-0.65 


0 


2 


3 


2 


0 


7 


U.UVJ 


1 
1 


1 

1 


2 


2 


1 


7 


0.65 


0 


1 


2 


2 


1 


6 


1.30 


1 


1 


2 


1 


0 


5 


1.95 


0 


0 


1 


1 


0 


2 


Marginal 


2 


9 


14 


11 


2 


38 


lna = 0.00 






d 








b 


-0.50 


-0.25 


0. 


0.25 


0.50 


Marginal 


-1.95 


0 


1 


1 


0 


0 


2 


-1.30 


0 


1 


2 


1 


0 


4 


-0.65 


0 


n 


2 


1 


1 


6 


0.00 


1 


2 


2 


1 


1 


7 


0.65 


0 


2 


3 


2 


0 


7 


1.30 


0 


2 


2 


2 


0 


6 


1.95 


0 


1 


2 


2 


0 


5 


Marginal 


1 


11 


14 


9 


2 


37 
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Table A-7 

Frequency of Item Parameter Combinations in Pool 3 (75 Items) 



lna = -0.30 d 



b -0.50 -0.25 0.0 0.25 0.50 Marginal 



-1.95 12 110 5 

-1.30 1 2 2 1 0 6 

-0.65 0 2 3 2 0 7 

0.00 0 2 3 2 0 7 

0.65 0 1 2 2 1 6 

1.30 0 1 2 1 1 5 

1.95 0 0 1 1 0 2 



Marginal 2 10 14 10 2 38 

/« a = 0.00 d 



b -0.50 -0.25 0.0 0.25 0.50 Marginal 



-1.95 0 1 1 0 0 2 

-1.30 0 1 2 1 0 4 

-0.65 1 2 2 1 0 6 

0.00 0 2 3 2 0 7 

0.65 0 2 3 2 0 7 

1.30 0 1 2 2 1 6 

1.95 0 1 1 2 1 5 



Marginal 1 10 14 10 2 37 
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Table A-8 

a, b, and d Parameters in Pool 2 and Pool 3 





Pools 1, 2, & 3 




Pool 2 






Pool 3 




tetn 


a 


b 


b'd 


d 


ad 


b'd 


d 


ad 


1 
1 


0.74 


-1.95 


-1.60 


-0.35 


-0.26 


-1.25 


-0.70 


-0.52 


9 


0.74 


-1.95 


-1.60 


-0.35 


-0.26 


-1.60 


-0.35 


-0.26 


-J 


0.74 


-1.95 


-1.95 


0.00 


0.00 


-1.60 


-0.35 


-0.26 


4 


0.74 


-1.95 


-1.95 


0.00 


0.00 


-1.95 


0.00 


0.00 




0.74 


-1.95 


-2.30 


0.35 


0.26 


-2.30 


0.35 


0.26 


6 


0.74 


-1.30 


-0.95 


-0.35 


-0.26 


-0.60 


-0.70 


-0.52 


7 


0.74 


-1.30 


-0.95 


-0.35 


-0.26 


-0.95 


-0.35 


-0.26 


o 


0.74 


-1.30 


-1.30 


0.00 


0.00 


-0.95 


-0.35 


-0.26 


Q 


0.74 


-1.30 


-1.30 


0.00 


0.00 


-1.30 


0.00 


0.00 




0.74 


-1.30 


-1.65 


0.35 


0.26 


-1.30 


0.00 


0.00 


1 1 
1 1 


0.74 


-1.30 


-1 6S 




0.26 


-1.65 


0.35 


0.26 


19 


0.74 


-0.65 






-0.26 


-0.30 


-0.35 


-0.26 


1 ^ 


0.74 


-0.65 


-U.JVJ 




-0.26 


-0.30 


-0.35 


-0.26 


1A 


0.74 


-0.65 


V/.VJ J 


0 00 

\J.\J\J 


0.00 


-0.65 


0.00 


0.00 


IS 


0.74 


-0.65 




0.00 


0.00 


-0.65 


0.00 


0.00 


1^ 


0.74 


-0.65 


-0.65 


0.00 


0.00 


-0.65 


0.00 


0.00 


17 


0.74 


-0.65 


-1.00 


0.35 


0.26 


-1.00 


0.35 


0.26 


18 


0.74 


-0.65 


-1.00 


0.35 


0.26 


-1.00 


0.35 


0.26 


19 


0.74 


0.00 


0.70 


-0.70 


-0.52 


0.35 


-0.35 


-0.26 


20 


0.74 


0.00 


0.35 


-0.35 


-0.26 


0.35 


-0.35 


-0.26 


21 


0.74 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


22 


0.74 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


23 


0.74 


0.00 


-0.35 


0.35 


0.26 


0.00 


0.00 


0.00 


24 


0.74 


0.00 


-0.35 


0.35 


0.26 


-0.35 


0.35 


0.26 


25 


0.74 


0.00 


-0.70 


0.70 


0.52 


-0.35 


0.35 


0.26 


26 


0.74 


0.65 


1.00 


-0.35 


-0.26 


1.00 


-0.35 


-0.26 


27 


0.74 


0.65 


0.65 


0.00 


0.00 


0.65 


0.00 


0.00 


28 


0.74 


0.65 


0.65 


0.00 


0.00 


0.65 


0.00 


0.00 


29 


0.74 


0.65 


0.30 


0.35 


0.26 


0.30 


0.35 


0.26 




0.74 


0.65 


0.30 


0.35 


0.26 


0.30 


0.35 


0.26 




0.74 


0.65 


-0.05 


0.70 


0.52 


-0.05 


0.70 


0.52 




0.74 


1.30 


2.00 


-0.70 


-0.52 


1.65 


-0.35 


-0.26 


DD 


0.74 


1.30 


1.65 


-0.35 


-0.26 


1.30 


0.00 


0.00 




0.74 


1.30 


1.30 


0.00 


0.00 


1.30 


0.00 


0.00 


J J 


0.74 


1.30 


1.30 


0.00 


0.00 


0.95 


0.35 


0.26 


36 


0.74 


1.30 


0.95 


0.35 


0.26 


0.60 


0.70 


0.52 


37 


0.74 


1.95 


1.95 


0.00 


0.00 


1.95 


0.00 


0.00 


38 


0.74 


1.95 


1.60 


0.35 


0.26 


1.60 


0.35 


0.26 


39 


1.00 


-1.95 


-1.60 


-0.35 


-0.35 


•1.60 


-0.35 


-0.35 


40 


1.00 


-1.95 


-1.95 


0.00 


0.00 


-1.95 


0.00 


0.00 


41 


1.00 


-1.30 


-0.95 


>UJ5 


-0.35 


-0.95 


-0.35 


-0.35 


42 


1.00 


-1.30 


-1.30 


0.00 


0.00 


-1.30 


0.00 


0.00 



^ (continued) 
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Table A-8 (continued) 
a, by and d Parameters in Pool 2 and Pool 3 





Pools 1, 2, & 3 




Pool 2 






Pool 3 




Ttem 


n 

Li 


h 

u 




A 
U 


uu 




A 


ad 


4J 


1 no 


-1 '^O 

•i . JVJ 


1 on 


n nn 
U.UU 


n nn 
U.UU 


1 on 

-I.jU 


n nn 

U.UU 


n nn 

U.UU 


A A 

44 


1 <V) 

1 .VA/ 




-1.0 J 


n OQ 
U.J J 


n 0^ 
U.Jj 


1 AS 
-1.0 J 


n OS 

U.JJ 


n OS 

U.Jj 


43 


1 <V) 






n 0^ 
-U.J J 


n 0^ 
-U.Jj 


n ns 

U.UJ 


n 7n 
-u./u 


j\ 7n 
-u./u 


4o 


i.VA/ 


S\ AS 




n 0^ 
-U.Jj 


n 0^ 
-U.Jj 


n on 

-U.JU 


_n OS 

-U.JJ 


n '^s 

-U.JJ 


47 


1 on 

1 .VA/ 


-0 fiS 


-U.Oj 


n nn 
U.UU 


n nn 
U.UU 


n on 

-U.jU 


n OS 

-U.JJ 


n OS 

-U.JJ 


4o 


1 on 


-0 fiS 


n AS 
-U.Oj 


n nn 

U.UU 


n nn 

U.UU 


n AS 

-U.OJ 


n nn 

U.UU 


n no 

U.VJU 


4y 


1 on 


-0 fiS 


1 nn 


n OS 

U.JJ 


n OS 

U.JJ 


n AS 

-U.OJ 


n nn 

U.UU 


0 00 

U.VJU 




1 on 

1 .VA/ 


-0 fiS 


1 OS 


n 7n 
u. /u 


n 7n 
u. /u 


1 nn 

-l.UU 


n '^s 

U.JJ 


0 '^S 

U.JJ 


CI 

jl 


1 .\J\J 


0 00 

\J,\J\J 


u. /u 


n 7n 
-u. /u 


n 7n 
-u. /u 


n '^s 

VJ.J J 


-n '^s 

-U.JJ 


-0 '^S 

-U.JJ 




1 00 


0 00 

\J .\J\J 


n OS 

U.J J 


n OS 
-U.Jj 


n OS 

-U.JJ 


n '^s 

U.JJ 


-n '^s 

-U.JJ 


-0 '^S 

-U.JJ 




1.00 


0 00 


0 "^S 
U. JJ 


n '^s 

-U.JJ 


Si '^S 
-U.JJ 


0 00 

VJ.vA/ 


0 00 

U.UU 


0 00 


D*t 


1.00 


0.00 


0 00 

u.uu 


n no 

VJ.VA/ 


0 00 

VJ.VA/ 


0 00 
yj.yjyj 


0 00 


000 


cc 


1 00 


0.00 


0 00 

U.VA/ 


0 no 

VJ.VA/ 


0 00 

VJ.VJVJ 


000 


000 


0.00 


jO 


1 00 


0 00 


n OS 

-U.Jj 


n '^s 

U.JJ 


n '^s 

U.JJ 


-0 '^S 

-VJ.J J 


0 '^S 

U.JJ 


0 '^S 

U.J J 


^1 
j/ 


1 00 


0 00 

\J .\J\J 


-u. /u 


n 7n 
u./u 


n 7n 
u. /u 


-0 '^S 

-VJ.J J 


0 '^S 

U.JJ 


0 '^S 

U.JJ 




1 00 


0 6S 


1 00 

1 .VA/ 


^n '^s 

"VJ.J J 


-0 '^s 

-VJ.J J 


1 00 


-0 


-0 




1 00 


0 6S 


1 00 
l.UU 


n '^s 

-U.JJ 


n '^s 

"U.JJ 


1 00 

l.VJVJ 


-0 '^S 

U.JJ 


-0 '^S 

U.JJ 


An 
ou 


1 00 

1 .\J\J 


0 r^s 


n AS 
U.Oj 


n nn 

U.UU 


n nn 

U.UU 


n AS 

VJ.UJ 


0 00 

U.VJU 


0 00 

U.UU 


Ai 

01 


1 00 

1 XrJ 


0 ^^s 


n AS 
U.Oj 


n nn 

U.UU 


n nn 

U.UU 


n AS 

U.OJ 


0 00 

U.UU 


0 00 

U.VJU 


AO 


1 00 


0 ^^s 


n A^ 
U.Oj 


n nn 

U.UU 


n nn 

U.UU 


n AS 

U.O.; 


n nn 

U.VJU 


n 00 

U.VJU 


AO 

oi 


1 00 


0 6S 


n on 
U.jU 


n OS 

U.JJ 


n OS 

U.JJ 


n '^n 

U.JU 


U.JJ 


0 '^S 

U.JJ 


AA 
04 


1 00 


0 6S 


n on 
u.ju 


n OS 

U.JJ 


n '^s 

U.JJ 


n '^n 

VJ.JVJ 


U.JJ 


0 '^S 

U.JJ 


A^ 


1.00 


1.30 


1 AS 
1.0 J 


n '^s 

-VJ.J J 


n '^s 

-VJ.J J 


1 .vlJ 


-0 '^S 

U.JJ 


-0 


AA 
DO 


1.00 


1.30 


1 AS 

1 .VJJ 


.n '^s 

-VJ.J J 


-0 '^S 

-U.JJ 


1 '^O 
1 . j\j 


0 00 


0.00 


A7 


1 00 


1.30 


1 '^O 


0 00 


0 00 


1.30 


0.00 


0.00 


Afi 
Oo 


1.00 


1.30 


1 '^O 
l.JU 


n nn 


n nn 

VJ.VJxJ 


0 QS 


0 "^S 

U.JJ 


0 '^S 


69 


1.00 


1.30 


0.95 


0.35 


0.35 


0.95 


0.35 


0.35 


70 


1.00 


1.30 


0.95 


0.35 


0.35 


0.60 


0.70 


0.70 


71 


1.00 


1.95 


2.30 


-0.35 


-0.35 


2.30 


-0.35 


-0.35 


72 


1.00 


1.95 


1.95 


0.00 


0.00 


1.95 


0.00 


0.00 


73 


1.00 


1.95 


1.95 


0.00 


0.00 


1.60 


0.35 


0.35 


74 


1.00 


1.95 


1.60 


0.35 


0.35 


1.60 


0.35 


0.35 


75 


1.00 


1.95 


1.60 


0.35 


0.35 


1.25 


0.70 


0.70 
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Table A-Q 
Pretest Item Parameters 



Item 


a 


b 


b-d 


d 


c 


1 


1.00 


-1.30 


-0.60 


-0.70 


0.15 


2 


1.00 


-1.30 


-0.95 


-0.35 


0.15 


3 


1.00 


-1.30 


-1.30 


0.00 


0.15 


4 


1.00 


-1.30 


-1.65 


0.35 


0.15 


5 


1.00 


-1.30 


-2.00 


0.70 


0.15 


6 


1.00 


0.00 


0.70 


-0.70 


0.15 


7 


1.00 


0.00 


0.35 


-0.35 


0.15 


8 


1.00. 


0.00 


0.00 


0.00 


0.15 


9 


1.00 


0.00 


-0.35 


0.35 


0.15 


10 


1.00 


0.00 


-0.70 


0.70 


0.15 


11 


1.00 


1.30 


2.00 


-0.70 


0.15 


12 


1.00 


1.30 


1.65 


-0.35 


0.15 


13 


1.00 


1.30 


130 


0.00 


0.15 


14 


1.00 


1.30 


0.95 


0.35 


0.15 


15 


1.00 


1.30 


0.60 


0.70 


0.15 



Table A-10 

True and Estimated Item Parameters for Pools 1, 2, and 3 



True Item Parameters Estimated Item Parameters 



tern 


a 


b 


c 


a 


b 


c 


1 


0.74 


-1.95 


0 15 


0.73 


-2.06 


0.14 


2 


0.74 


-1.95 


0.15 


0.75 


-1.91 


0.14 


3 


0.74 


-1.95 


0.15 


0.64 


-2.31 


0.14 


4 


0.74 


-1.95 


0.15 


0.64 


-2.18 


0.14 


5 


0.74 


-1.95 


0.15 


0.73 


-2.08 


0.14 


6 


0.74 


-1.30 


0.15 


0.70 


-1.45 


0.14 


7 


0.74 


-1.30 


0.15 


0.69 


-1.53 


0.14 


8 


0.74 


-1.30 


0.15 


0.80 


-1.25 


0.14 


9 


0.74 


-1.30 


0.15 


0.72 


-1.36 


0.14 


10 


0.74 


-1.30 


0.15 


0.68 


-1.35 


0.14 


11 


0.74 


-1.30 


0.15 


0.68 


-1.48 


0.14 


12 


0.74 


-0.65 


0.15 


0.73 


0.69 


O.U 


13 


0.74 


-0.65 


0.15 


0.81 


-0.47 


0.21 


14 


0.74 


-0.65 


0.15 


0.75 


-0.81 


0.14 


15 


0.74 


-0.65 


0.15 


0.79 


-0.66 


0.14 


16 


0.74 


-0.65 


0.15 


0.73 


-0.73 


0.14 


17 


0.74 


-0.65 


0.15 


0.82 


-0.63 


0.14 


18 


0.74 


-0.65 


0.15 


0.69 


-0.76 


0.14 


19 


0.74 


0.00 


0.15 


0.79 


0.07 


0.18 


20 


0.74 


0.00 


0.15 


0.67 


-0.23 


0.07 


21 


0.74 


0.00 


0.15 


0.68 


-0.06 


0.12 


22 


0.74 


0.00 


0.15 


0.83 


-0.07 


0.12 


23 


0.74 


0.00 


0.15 


0.92 


0.18 


0.23 


24 


0.74 


0.00 


0.15 


0.91 


0.11 


0.19 


25 


0.74 


0.00 


0.15 


0.78 


-0.07 


0.13 


26 


0.74 


0.65 


0.15 


0.89 


0.62 


0.15 


27 


0.74 


0.65 


0.15 


0.73 


0.56 


0.12 


28 


0.74 


0.65 


0.15 


0.93 


0.64 


0.18 


29 


0.74 


0.65 


0.15 


1.08 


0.74 


0.22 


30 


0.74 


0.65 


0.15 


0.61 


0.47 


0.06 


31 


0.74 


0.65 


0.15 


0.59 


0.42 


0.04 


32 


0.74 


1.30 


0.15 


0.83 


1.41 


0.19 


33 


0.74 


1.30 


0.15 


0.69 


1.22 


0.14 


34 


0.74 


1.30 


0.15 


0.66 


1.31 


0.10 


35 


0.74 


1.30 


0.15 


0.79 


1.22 


0.16 


36 


0.74 


1.30 


0.15 


0.75 


1.29 


•0.15 


37 


0.74 


i.95 


0.15 


0.54 


2.28 


0.13 


38 


0.74 


1.95 


0.15 


0.6 


1.94 


0.13 


39 


1.00 


-1.95 


0.15 


0.97 


-2.02 


0.14 


40 


1.00 


-1.95 


0.15 


1.01 


-2.01 


0.14 


41 


1.00 


-1.30 


0.15 


1.03 


-1.37 


0.14 


42 


1.00 


-1.30 


0.15 


1.00 


-1.44 


0.14 



(continued) 
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Table A-IO (continued) 
True and Estimated Item Parameters for Pools 1, 2, and 3 

True Item Parameters Estimated Item Parameters 



Item 




b 


c 


a 


b 


c 






-1.30 


0.15 


1.07 


-1.30 


0.14 


dd 


1 00 


-1 '^0 


0.15 


1.00 


-1.40 


0.14 




1 00 


-0 6*) 


0.15 


1.11 


-0.58 


0.17 


tu 


1.00 


-0.65 


0.15 


1.05 


-0.69 


0.14 


47 


1.00 


-0.65 


0.15 


1.07 


-0.59 


0.22 


48 

■to 


1.00 


-0.65 


0.15 


1.10 


-0.55 


0.20 


49 


1.00 


-0.65 


0.15 


1.00 


-0.70 


0.12. 


50 


1.00 


-0.65 


0.15 


0.97 


-0.83 


0.11 


51 


1.00 


0.00 


0.15 


0.92 


-0.15 


0.10 


52 


1.00 


0.00 


0.15 


1.20 


0.04 


0.19 


53 


1.00 


0.00 


0.15 


1.02 


-0.03 


0.14 


54 


1.00 


0.00 


0.15 


0.98 


-0.08 


0.09 


55 


1.00 


0.00 


0.15 


0.92 


-0.20 


0.05 


56 


1.00 


0.00 


0.15 


1.20 


0.11 


0.19 


57 


1.00 


0.00 


0.15 


0.86 


-0.16 


0.10 


58 


1.00 


0.65 


0.15 


0.82 


0.52 


0.10 


59 


1.00 


0.65 


0.15 


0.90 


0.65 


0.12 


60 


1.00 


0.65 


0.15 


1.13 


0.69 


0.16 


61 


1.00 


0.65 


0.15 


0.95 


0.59 


0.12 


62 


1.00 


0.65 


0.15 


1.02 


0.64 


0.15 


63 


1.00 


0.65 


0.15 


0.86 


0.49 


0.09 


64 


1.00 


0.65 


0.15 


1.05 


0.60 


0.16 


65 


1.00 


1.30 


0.15 


1.26 


1.20 


0.15 


66 


1.00 


1.30 


0.15 


1.33 


1.33 


0.19 


67 


1.00 


1.30 


0.15 


1.27 


1.38 


0.19 


68 


1.00 


1.30 


0.15 


1.15 


1.17 


0.14 


69 


1.00 


1.30 


0.15 


1.42 


1.23 


0.18 


70 


1.00 


1.30 


0.15 


0.87 


1.44 


0.15 


71 


1.00 


1.95 


0.15 


1.06 


1.81 


0.12 


72 


1.00 


1.95 


0.15 


0.89 


2.04 


0.16 


73 


1.00 


1.95 


0.15 


1.17 


1.84 


0.16 


74 


1.00 


■1.95 


0.15 


1.53 


1.81 


0.17 


75 


1.00 


1.95 


0.15 


1.09 


1.92 


0.16 
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Appendix B 
Variance of the ET Estimator of MH D-Dir 



•This variance was derived by Charles Lewis based on the work of Phillips and Holland 
(1987). 
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Let 6tl,« be the MH odds ratio computed on the adjusted table frequencies, and MH D- 
DIF' be the ET estimate of MH D-DIF based on the target sample sizes and Then, 
based on the results in equations 4-6, 



SeJ^MH D-DIF') = 2.35^ Va;(z/i(&;;,//)) - 



where Var[ln[&lm^ is estimated by 



./—•2 



2(z A, djt:J 



where 



V; = (A, + D,) + (5, + C,) 



and Tt - 
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Appendix C 

■s 

Investigation of the ET Estimation Procedure 
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To check on the validity of the ET estimation procedure, we compared the results 
obtained using the ET method to those that we would have obtained using a more standard 
simulation procedure. We wanted to compare estimation procedures in the worst possible 
case; therefore, we chose Condition 5, which has the less stable sample size condition (n^ = 
900, nj. = ICQ), the largest between-group ability difference (the focal N(-l, 1) population), 
and the most complex DIF structure (Pool 3). It would have been extremely expensive to 
conduct this validity check with the CAT data, in which each record includes different subsets 
of items. Therefore, we used the data from the 15 pretest items. (Note that the identity of 
the item pool was therefore relevant only to the matching variable; the DIF structure within 
the pretest items was the same in all conditions;- see section 2.3.5). Because we had already 
generated data for 60,000 examinees in each group, we could use existing data to create 66 
independent replications of the DIF analysis. (This number is the result of dividing 60,000 by 
900 and then rounding down to the next lowest integer.) Table C-1 gives a comparison of the 
MH D-DIF results obtained from the 66 replications to the ET estimates reported in our 
study. The STD P-DIF findings yielded a similar picture of the agreement between the two 
estimation procedures. It is important to keep in mind that the two sets of results in Table 
C-1 are alternative estimates of unknown parameters; neither set can be regarded as the 
criterion. The main findings were as follows. 

1. The values of the ET estimate of MH D-DIF were very close to the values obtained by 
averaging over 66 replications. The standard error of the difference between the ET estimate 
and the average over replications can be estimated using the standard errors from each 

ERJC 92 
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replication and the standard errors of the ET estimate (Appendix B). The standard errors of 
the averages over 66 replications were about .08 for these items; the standard errors of the ET 
estimates were about .04. This yielded values of about .09 for the standard errors of the 
difference between these two estimates. Only five of the differences between the two sets of 
estimates were found to be greater than .09 in magnitude. This is consistent with what would 
be expected if the differences were normally distributed, with a mean of zero. It is interesting 
that the ET approach yielded a more precise estimate of MH D-DIF than the average over 66 
replications. In fact, for the Condition 5 pretest items, about 316 replications would have 
been required to match the precision of the ET estimate. 

2. The distribution of MH D-DIF across items was examined. Based on the ET estimates, 
the mean and standard deviation were .17 and 1.27, respectively. Based on the mean MH D- 
DIF across replications, the across-item mean and standard deviation were found to be .12 
and 1.32. The correlation across items between the two estimates of MH D-DIF was .997. 

3. Estimates of the standard error of MH D-D!F were also considered- ..lere, three estimates 
were avaOable-the ET estimate SE^MH D-DIF), the average of SE(MH D-DIF) over 
replications, and the observed standard deviation of MH D-DIF across replications. 
Differences among the estimates were very small. The average SE(MH D-DIF) tended to be 
slightly larger than the standard deviation of MH D-DIF, as found by Donoghue, Holland, and 
Thayer (1993). SE^MH D-DIF) tended to be slightly smaller than the standard deviation. 
The across-item correlation between SE^MH D-DIF) and the average SE(MH D-DIF) was 

an 
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.992. Each of these estimates had correlations of about .8 with the standard deviation of MH 
D'DIF. 



4. The estimated proportion of A, B, and C categorizations were examined. As shown in 
Table Ol, agreement between the two estimation methods on the estimated proportion of 
times the item would be labeled a "C," which was our main focus, were satisfactory for most 
items. An exception is item 6, which also had the largeiit discrepancy between methods in 
the estimated MH D-DIF statistics. 

In summary, the ET method appears to give similar results to those obtained using more 
conventional estimation methods. In our study of the 15 pretest items, the ET estimates were 
much more precise than those that would have been obtained using an affordable number of 
replications. 
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Table C-1 

Comparison of ET Estimates of DIP Statistics to 
Estimates Based on 66 Replications 

Estimated Perceiits in 
ETS DIP Categories 



Item 




MH D-DIF 


SE(MH 


DIF) 


A 


B 


C 


1 


ET 


-2.12 


.61 




12.1 


34.8 


53.0 




Reps. 


-2.12 


.65' 


(.65) 


6.7 


36.2 


57.1 


2 


ET 


-1.06 


.62 




71.2 


22.7 


6.1 




Reps. 


-1.09 


.65 


(.58) 


60.1 


33.9 


6.1 




ET 


0.08 


.65 




97.0 


3.0 


0.0 




Reps 


-0.16 


.68 


(.68) 


94.8 


5.0 


0.2 


4 


ET 


1.25 


.71 




71.2 


19.7 


9.1 




Reps. 


1.28 


.75 


(.78) 


57.7 


32.5 


9.8 


5 


ET 


2.55 


.79 




10.6 


19.7 


69.7 




Reps. 


2.66 


.85 


(.79) 


10.5 ■ 


27.5 


62.0 


6 


ET 


-1.46 


.63 




30.3 


37.9 


31.8 




Reps. 


-1.72 


.67 


(.71) 


36.1 


45.9 


17.9 


7 


ET 


-0.74 


.61 




77.3 


21.2 


1.5 




Reps 


-0.78 


.64 


(.61) 


77.2 


20.9 


1.9 


8 


ET 


0.02 


.60 




100.0 


. 0.0 


0.0 




Reps 


0.01 


.62 


(.59) 


95.0 


4.9 


0.1 


9 


ET 


1.01 


.60 




71.2 


21.2 


7.6 




Reps. 


1.00 


.62 


(.63) 


60.6 


34.3 


5.2 


10 


ET 


1.98 


.61 




12.1 


42.4 


45.5 




Reps. 


2.02 


.64 


(.66) 


9.6 


41.6 


48.7 


11 


ET 


-0.47 


.70 




92.4 


7.6 


0.0 




Reps. 


-0.46 


.73 


(.78) 


89.7 


9.5 


0.8 


12 


ET 


-0.20 


.68 




93.9 


6.1 


0.0 




Reps. 


-0.43 


.73 


(.76) 


94.0 


5.8 


0.3 


13 


ET 


0.10 


.67 




95.5 


4.5 


0.0 




Reps. 


0.13 


.70 


(.65) 


94.7 


5.1 


0.2 


14 


ET 


0.54 


.65 




97.0 


3.0 


0.0 




Reps. 


0.39 


.68 


(.62) 


86.8 


12.3 


0.9 


15 


ET 


1.11 


.63 




69.7 


28.8 


1.5 




Reps. 


1.06 


.66 


(.58) 


57.9 


35.1 


7.0 



*The lefthand entry in the "Reps." row of this column is the average of the 66 values of SE(MH D-DIF) 
from the replications. The parenthesized value is the standard deviation of the 66 values of MH D-DIF from the 
replications. 

Do 
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Appendix D 

Expected Proportions of A, B, and C DEF Results 
Based on ETS Classification Rules 



o 9S 
ERIC 
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Suppose that the MH D-DIF statistic, M, has a normal distribution with mean \i and 
variance Cf^. Then estimates of the proportion of A, B, and C items {PROPA, PROPB and 
PROPQ can be derived as follows, using the MH D-DIF and SEiMH D-DIF) for all available 
data as estimates of |i and a. 



1. First, estimate PROPBC, the proportion of times the item will be a B or C item: 

p{MHchi-square > 3.84) = /{(m/o)' > 3.84) 
= p{m/g > 1.96) + P{M/a < -1.96) 

= p({m - ii)/a > (l.96 - \i/a)) + p({m - \i)/a < (-1.96 - |j/a)) 

= f(z> (i.96 - \i/a) + p(z < (-1.96 - |j/a) 
where Z is a standard normal variable. 



P(|M| > l) = P{M > l) + p{m < -l) 

= p{{m - \x)/g > (l - n)/a) + p{{m - n)/a < (-1 - \i)/a) 

= p(z>{l - \x)/a] + f(z < (-1 - n)/a) 

KIBC = min(-1.96 - [jJa, (-1 - \i)/a] 



K2BC = max(l.96 - |j/a, (l - \i)/a) 



PROPBC = P{Z > KlBC) P{Z < KlBC). 
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2. Next, estimate PROPC, the proportion of times the item will be a C item: 

P(\M\ > 1.65a + l) = p{m > 1.65a + l) + p(m < -1.65a - l) 

= p((m - ii)/a > 1.65 + (l - \i)/c) + p((m - n)/a < -1.65 - (l + \i)h) 

= p(z > 1.65 + (l - + l{z< -1.65 - (l + ii)/a) 
P{\M\ > 1.5) = p(m > 1.5) + p(m < -1.5) 

= p((m - ^)/a > (1.5 - vi)/a) + p((m - ii)/a < (-1.5 - ii)/a) 

= p(z> (1.5 - ^ )/a ) + p(z < (-1.5 - 

KIC = rain(-1.65 - (l + \i)h, (-1.5 - \i)/a) 
K2C = max(l.65 + (l - \i)/c, (l.5 - 
PROPC = P(Z > K2C) + P{Z<KIC) 

3. Now calculate PROPB and PROPA by subtraction: 
= PROPEL - PROPC 
PROPA = 1 - PROPBC 
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Table 1 

The 18 Administration Conditions 
odmple size per item 



Condition Focal Population Focal Reference Pool 

1 N(-l,l) 100 900 1 

2 N(-l,l) 500 500 1 

3 N(-l.l) 100 900 2 

4 N(-l.l) 500 500 2 

5 N(-l,l) 100 900 3 

6 N(-l,l) 500 500 3 

7 N(O.l) 100 900 1 

8 N(O.l) 500 500 1 

9 N(O.l) 100 900 2 

10 N(0,1) 500 500 2 

11 N(O.l) 100 900 3 

12 N(0.1) 500 500 3 

13 N(.5.1) 100 900 1 

14 N(.5,l) 500 500 1 

15 N(.5,l) 100 900 2 

16 N(.5.1) 500 500 2 

17 N(.5.1) 100 900 3 

18 N(.5,l) 500 500 3 



"Pool 1: no DIF, Pool 2: DIF uncorrelaied with item difficulty. Pool 3: DIP posiUvcly correlated with item 
difficulty. 
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Table 2 

Means and Standa-d Deviations of In a, b, and d for Item Pools 1, 2, and 3 
(Assuming a Multivariate Normal Distribution) 



Pool 



In cf 



b 



d 



2 and 3 



mean -.15 -.15 

s.d. .30 .30 



mean 0 0 

s.d. 1.5 1.5 



mean 0 0 

s.d. 0 .30 



Vn a normal with mean -.15 and s.d. .30 corresponds to a log-normal with mean .9 and s.d. .28. 
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Table 3 

Correlation Matrices of In a, b, and d for Item Pools 1, 2, and 3 
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Pools 1 and 2 

In a b d 



In a 1 .40 0 

b 1 0 

d 1 

Pool 3 

In a ■ b d 



In a 1 .40 0 

b 1 .40 

d 1 
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Table 4 

Correlations for True DIF (ad) and MH D-DIF Statistics Based on Three Types of Matching Variables • 



Variables 


Type of 
Correlation 


Condition: 
Pool: 
Focal Group: 


4 

2 

NM,1) 


6 
3 

N(-l,l) 


10 
2 

N(O.l) 


12 

3 

N(O.l) 


16 
2 

N(.5.1) 


18 
3 

N(.5.1) 


Median 


A O AT 


Uncorrected 






.OO 




.00 


Q1 
.y I 


8Q 


.89 




Corrected 




.93 


1.00" 


1.00 


.97 


1.00" 


.99 


.99 


6 -CAT NR 


Uncorrected 




.05 


.0/ 


QQ 


.00 


on 


QO 


88 
•00 




Corrected 




.96 


.99 


.99 


.96 


1.00 


1.00' 


.99 


6-CAT £U2 


Uncorrected 




.yo 




Q8 


OA 


OQ 


Q^ 

.-7U 


•y\j 




Corrected 




.97 


.96 


.99 


.97 


1.00 


.97 


.97 


Q-/J INK 


Uncorrected 




QQ 


QQ 


QQ 


QQ 
.yy 


QQ 

. ^y 


99 


.99 




Corrected 




1.00" 


1.00" 


1.00" 


1.00" 


1.00" 


1.00" 


1.00* 


§-75 ad 


Uncorrected 




.84 


.86 


.88 


.85 


.90 


.88 


.87 




Corrected 




.93 


.97 


.98 


.93 


.99 


.98 


.97 


NR 


Uncorrected 




.86 


.87 


.88 


.84 


.89 


.89 


.88 




Corrected 




.95 


.98 


.98 


.92 


.99 


.98 


.98 



*In this table. S-CAT, S-75, and NR refer to the MH D-DIF statistics that result from matching on expected true 
score based on the CAT, expected true score based on 75 item responses, and number-right score based on 75 
items, respectively. Correlations are based on 71 items because 4 items were never administered in the CAT. 

^'Corrected value was greater than unity. 
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Table 5 

Correlations for True DIF (ad) and STD P-DIF Statistics Based on Three Types of Matching Variables' 



Variables 


Type of 
Correlation 


Condition: 
Pool: 
Focal Group: 


4 

2 

NM,1) 


6 
3 

N(-l,l) 


10 

2 

N(0,1) 


12 
3 

N(0,1) 


16 
2 

N(.5,l) 


18 
3 

N(.5,l) 


Median 


u-CAl u-Zj 


u n cori cLicu 




.80 


.80 


.87 


.88 


.91 


.86 


.86 




Corrected 




.89 


.91 


.96 


.97 


1.00" 


.95 


.95 


0-CAl NR 


Uncorrected 




.oi 


80 


.88 


.87 


•91 


.87 


.87 




Corrected 




.93 


.94 


.99 


.96 


1.00 


.95 


.95 


u-L. A 1 Qu 






96 


.93 


.98 


.96 


.99 


.96 


.96 




Corrected 




.97 


.94 


.99 


.96 


.99 


.96 


.97 




T Tnr'niTPr'tPil 




.95 


.95 


.98 


.98 


.98 


.99 


.98 




Corrected 




1.00" 


1.00" 


1.00" 


1.00" 


1.00" 


1.00" 


1.00" 


e-75 ad . 


Uncorrected 




.82 


.79 


.87 


.87 


.91 


.88 


.87 




Corrected 




.90 


.88 


.95 


.96 


1.00 


.97 


.95 


NR ad 


Uncorrected 




.83 


.81 


.88 


.87 


.90 


.88 


.87 




Corrected 




.94 


.93 


.98 


.95 


.99 


.96 


.95 



Mn liiis table, B-CAT, 9-75, and NR refer to the STD P-DIF statistics that result from matching on expected true 
score based on the CAT, expected true score based on 75 item responses, and number-right score based on 75 
items, respectively. Correlations are based on 71 items because 4 items were never administered in the CAT. 

^Corrected value was greater than unity. 
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Table 6 

Means and Stmdard Deviations of DIF Statistics for Three Types of Matching Variables**" 







Condition: 


4 


6 


10 


12 


16 


18 




Matching 


Number 


Pool: 


2 


3 


2 


3 


2 


3 




Variable 


of Items' 


Focal Group: 


N(-l.l) 


N(-l.l) 


N(0,1) 


N(0,1) 


N(.5.1) 


N(.5.1) 


Median 






MH D-DIF 




e-CAT 


71 


mean 


.00 


.02 


.02 


.03 


.01 


.05 


.02 






s.d. 


.96 


.89 


.99 


.94 


1.02 


.96 


.96 


8-75 


71 


mean 


-.02 


.01 


-.01 


.00 


-.01 


-.04 


-.01 






s.d. 


.97 


.90 


.92 


.96 


1.02 


.97 


.97 


NR 


71 


mean 


-.02 


-.02 


-.02 


-.02 


-.02 


-.08 


-.02 






s.d. 


.97 


.88 


.93 


.99 


1.03 


.97 




NR 


75 


mean 


.01 


.01 


.01 


.00 


.02 


-.04 


.01 






s.d. 


.96 


.87 


.93 


.98 


1.02 


.97 


Q7 






STD P-DIF X 10 , 




e-CAT 


71 


mean 


.01 


.02 


.01 


.02 


.00 


.02 


.01 






s.d. 


.75 


.72 


.79 


.76 


.79 


.77 


.11 


e-75 


71 


mean 


-.09 


-.05 


.03 


.02 


.03 


.04 


.02 






s.d. 


.66 


.62 


.62 


.64 


.62 


.65 


.OJ 


NR 


71 


mean 


-.07 


-.05 


.00 


-.01 


.02 


.01 


.00 






s.d. 


.71 


.67 


.64 


.65 


.62 


.65 


.64 


NR 


75 


mean 


-.04 


-.02 


.03 


.02 


.05 


.04 


.02 






s.d. 


.70 


.66 


.65 


.65 


.63 


.65 


.65 



*e-CAT, 9-75, and NR refer to the DIF methods that match on expected true score based on the CAT, expected true score based on 75 
item responses, and number right score based on 75 items, respectively. Correlations are based on 71 items because 4 items were never 
administered in the CAT. 

^For conditions 4, 10, and 16, the mean value of ad across 71 items is -.004 and the standard deviation is .293. For condiUons 6, 12, and 
18, the mean and standard deviation are -.001 and .293, respectively. 

This column gives the number of items on which the tabled means and standard deviations are based. 
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Table 7 
Pool 1: 

Average MH D-DIF for Each Value of b in the 500, 500 Sample Size Condition* 



Item 


Value of 


Difficulty 


ad' 




n 
u 


-1.95 


0.0 




(0.09) 




21 


-1.30 


0.0 




(0.07) 




30 


-.65 


0.0 




(0.06) 




39 


0 


0.0 




(0.11) 




39 


.65 


0.0 




(0.06) 




33 


1.30 


0.0 




(0.09) 




33 


1.95 


0.0 




(0.10) 




18 


Average 


0.0 




(0.08) 



1 213 



The first entry in each ceU is the average MH D-DIF for the indicated values of ad and b, the second entry is the aveiage standard error 
of the estimate, and the third entry is the number of item results over which the averages were computed. 

W = 0 for all items in Pool 1. 
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Table 8 



Pool 2: 

Average MR D-DIF for Each Combination of and in the 500, 500 Sample Size Condition* 



Item 
Difficulty 

[0) 


7ft 




- 3S 


- 96 


Value of 
ad 


26 


.35 


.52 


.70 


Average 


-1.95 






-1.3 
(0.07) 
3 


-0.9 
(0.08) 
6 


0.0 
(0.10) 
9 


1.0 

(0.10) 
3 








-0.3 

(0.09) 

21 


-1.30 






-1.3 
(0.05) 
3 


-0.9 
(0.08) 
6 


0.0 
(0.06) 
12 


0.9 

(0.09) 
6 


1.3 

(0.06) 
3 






0.0 
(0.07) 
30 


-.65 






-1.3 

(0.04) 
6 


-0.9 
(0.08) 
6 


0.0 
(0.05) 
15 


0.8 
(0.10) 
6 


1.2 

(0.04) 
3 




2.4 
(0.05) 
3 


0.0 
(0.06) 
39 


0 


-2.1 
(0.04) 
3 


-2.2 
(0.30)" 
3 


-1.2 
(0.04) 
6 


0.1 

(0.58)^ 
3 


0.0 
(0.04) 
9 


0.9 . 
(0.05) 
6 


1.2 

(0.04) 
3 


1.8 

(0.08) 
3 


2.5 
(0.05) 
3 


0.1 

(0.11) 
39 


.65 






-1.2 
(0.06) 
6 


-0.9 
(0.06) 
3 


0.0 
(0.08) 
15 


1.0 

(0.05) 
3 


1.2 

(0.04) 
6 






0.0 
(0.07) 
33 


1.30 




-1.8 
(0.10) 
3 


-0.9 
(0.06) 
6 


-0.5 
(0.16) 
3 


0.0 
(0.10) 
12 


1.0 

(0.11) 
3 


1.1 

(0.06) 
6 






-0.1 
(0.09) 

55 


1.95 






-0.7 
(0.09) 
3 




0.0 
(0.09) 
6 


1.1 

(0.20) 
3 


0.8 

(0.06) 
6 






0.3 

(0.10) 
18 


Average 


-2.1 
(0.04) 
3 


-2.0 
(0.20) 
6 


-1.1 

(0.05) 
33 


-0.7 

(0.14) 

27 


0.0 
(0.07) 
78 


0.9 
(0.09) 
30 


1.1 

(0.05) 
27 


1.8 

(0.08) 
3 


2.5 

(0.05) 
6 


0.0 
(0.08) 
213 



*The first entry in each cell is the average MH D-DIF for the indicated values of ad and fc, the second entry is the average standard error 
of the estimate^ and the third entry is the number of item results over which the averages were computed. 

*The average standard error is large because of the sparsity of data for Item 19. 

The average standard error is large because of the sparsity of data for Item 20. 
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Table 9 



Pool 3: 



Average MH D-DIF for Each Combination of ad and b in the 500, 500 Sample Size Condition* 



Item 
Difficulty 


-. /u 






- 26 


Value of 
ad 

0 


.26 


.35 


.52 


.70 


Average 


-1.95 




-1.6 
(0.08) 
3 


-0.9 
(0.07) 
3 


-0.7 
(0.09) 
6 


0.3 
(0.09) 
6 


1.3 

(0.09) 
3 








-0.3 

(0.09) 

21 


-1.30 




-1.4 
(0.07) 
3 


-0.9 
(0.05) 
3 


-0.6 
(0.07) 
6 


0.3 
(0.06) 
12 


1.2 

(0.08) 
3 


1.6 

(0.06) 
3 






0.0 
(0.06) 
30 


-.65 


-2.2 
(0.04) 
3 




-1.0 
(0.04) 
6 


-0.7 
(0.08) 
6 


0.2 
(0.05) 
15 


1.0 

(0.09) 
6 


1.5 

(0.04) 
3 






-0.1 

(0.06) 

39 


0 






-1.1 
(0.04) 
6 


-0.6 
(0.46)" 
6 


0.1 

(0.04) 
15 


1.0 

(0.06) 
6 


1.3 

(0.04) 
6 






0.1 

(0.11) 
39 


.65 






-1.4 
(0.06) 
6 


-1.0 
(0.06) 
3 


-0.1 

(0.08) 

15 


0.8 

(0.05) 
3 


1.1 

(0.05) 
6 






-0.1 

(0.07) 

33 


1.30 






-1.1 
(0.05) 
3 


-1.3 
(0.09) 
3 


-0.3 

(0.11) 

12 


0.6 
(0.08) 
3 


0.9 

(0.05) 
6 


1.6 

(0.11) 
3 


2.1 

(0.08) 
3 


0.2 
(0.09) 
33 


1.95 






-1.1 
(0.08) 
3 




-0.5 
(0.11) 
3 


0.6 
(0.17) 
3 


0.0 
(0.06) 
6 




1.5 

(0.06) 
3 


0.3 

(0.09) 


Average 


-2.2 
(0.04) 
3 


-1.5 
(0.08) 
6 


-1.1 

(0.05) 

30 


-0.7 

(0.15) 

30 


0.0 
(0.07) 
78 


0.9 
(0.09) 
27 


11 

(0.05) 
30 


1.6 

(0.11) 
3 


1.8 

(0.07) 
6 


0.0 
(0.08) 
213 



'The first entry in cacti ceU is the average MH D-DIF for the indicated values of ad and b, the second entry is the average standard eaor 
of the estimate, and the third entry is the number of item results over which the averages were computed. 

"The average standard error is large because of the sparsity of data for Item 20. 
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Table 10 
Pool 1: 

Average STD P-DIF x 10 for Each Value of ^ in the 500, 500 Sample Size Condition* 



Item 
Difficulty 

(b) 


Value of 
0 


-1.95 


0.00 
(0.07) 
21 


-1.30 


-0.01 

(0.06) 

30 


-.65 


0.00 
(0.05) 
39 


0 


-0.01 

(0.08) 

39 


.65 


0.01 
(0.06) 
33 


1.30 


0.01 
(0.08) 
33 


1.95 


0.02 
(0.09) 
18 


Average 


0.00 
(0.07) 
213 



*The first entry in each cell is the average STD P-DIF, multiplied by 10, for the indicated values of ad and b, the second entry is the 
average standard error of the estimate, multiplied by 10, and the third entry is the number of item results over which the averages were 
computed. 

^ad = 0 for all items in Pool 1. 
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Table 11 



Pool 2: 

Average STD P-DIF x 10 for Each Combination of ad and b in the 500, 500 Sample Size Condition* 



Item 
Difficulty 


'if\ 
-. /u 








Value of 
ad 


.26 


.35 


.52 


.70 


Average 


-1.95 






-0.70 
(0.04) 
3 


-0.74 
(0.07) 
6 


0.04 
(0.08) 
9 


0.72 
(0.07) 
3 








-0.19 

(0.07) 

21 


-1.30 






-0.92 
(0.04) 
3 


-0.83 
(0.07) 
6 


0.02 
(0.05) 
12 


0.81 
(0.08) 
6 


0.84 
(0.04) 
3 






0.00 
(0.06) 
30 


-.65 






-0.98 
(0.03) 
6 


-0.78 
(0.07) 
6 


-0.04 

(0.05) 

15 


0.64 
(0.08) 
6 


0.89 
(0.04) 
3 




1.70 
(0.03) 
3 


0.01 
(0.05) 
39 


0 


-1.83 
(0.04) 
3 


-1.27 

(0.20)" 
3. 


-0.91 
(0.03) 
6 


0.08 
(0.36)' 
3 


-0.03 
(0.04) 
9 


0.76 
(0.04) 
6 


0.81 
(0.03) 
3 


1.49 
(0.06) 
3 


2.02 
(0.04) 
3 


0.07 
(0.08) 
39 


.65 






-1.00 
(0.05) 
6 


-0.75 
(0.05) 
3 


-0.01 

(0.07) 

15 


0.80 
(0.04) 
3 


0.97 
(0.04) 
6 






0.00 
(0.06) 
33 


1.30 




-1.66 
(0.09) 
3 


-0.72 
(0.05) 
6 


-0.41 
(0.14) 
j 


-0.02 
(0.09) 

1 0 


0.82 
(0.10) 


0.90 

(0.06) 

f, 
\j 






-0.09 

(0.08) 

33 


1.95 






-0.58 
(0.07) 
3 




0.00 
(0.08) 
6 


0.97 
(0.18) 
■ 3 


0.63 
(0.05) 
6 






0.28 
(0.08) 
18 


Average 


-1.83 
(0.04) 
3 


-1.46 
(0.15) 
6 


-0.85 

(0.04) 

33 


-0.64 

(0.11) 

27 


-0.01 

(0.06) 

78 


0J7 
(0.08) 
30 


0.84 
(0.04) 
27 


1.49 
(0.06) 
3 


1.86 
(0.04) 
6 


0.01 
(0.07) 
213 



The first entry in each cell is the average STD P-DIF, mulUplied by 10. for the indicated values of ad and b, the second entry is i 
average standard error of the estimate. mulUpUed by 10. and the tMrd entry is the number of item results over which the averages 
computed. 

The average standard error is large because of the sparsity of data for Item 19. 
The average standard error is large because of the sparsity of data for Item 2ti. 
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Table 12 



Pool 3: 



Average STD P-DIF x 10 for Each Combination of ad an'd b in the 500, 500 Sample Size Condition* 



Item 
Difficulty 


-.70 


-.52 


-.35 


-.26 


Value of 
ad 
0 


.26 


.35 


.52 


.70 


Average 


-1.95 




-1.35 
(0.07) 
3 


-0.49 
(0.04) 
3 


-0.55 
(0.08) 
6 


0.21 
(0.06) 
6 


0.98 
(0.07) 
3 








-022 
(0.07) 
21 


-1.30 




-1.28 

(0.07) 
3 


-0.66 
(0.04) 
3 


-0.54 

(0.06) 

6 


0.22 
(0.05) 
12 


1.09 
(0.08) 
3 


1.04 
(0.04) 
3 






0.00 

(0.06) 

30 


-.65 


-1.74 

(0.03) 
3 




-0.78 
(0.04) 
6 


-0.62 
(0.07) 
6 


0.18 
(0.05) 
15 


0.85 
(0.08) 
6 


1.12 
(0.04) 
3 






-0.06 

(0.05) 
39 


0 






-0.87 
(0.03) 
6 


-0.38 
(0.32)" 
6 


0.07 
(0.04) 
15 


0.84 
(0.06) 
6 


1.05 
(0.04) 
6 






0.13 
(0.08) 
39 


.65 






-1.17 
(0.05) 
6 


-0.85 
(0.05) 
3 


-0.09 

(0.07) 

15 


0.65 
(0.04) 
3 


0.91 
(0.04) 
6 






-0.11 

(0.06) 

33 


1.30 






-0.88 
(0.04) 
3 


-1.13 
(0.09) 
3 


-0.22 

(0.09) 

12 


0.49 
(0.08) 
3 


0.71 
(0.04) 
6 


1.25 
(0.09) 
3 


1.60 
(0.07) 
3 


0.17 
(0.07) 
33 


1.95 






-0.87 
(0.07) 
3 




-0.46 
(0.10) 
3 


0.60 
(0.16) 
3 


0.45 
(0.05) 
6 




1.20 
(0.05) 
3 


0.23 
(0.08) 
18 


Average 


-1.74 
(0.03) 
3 


-1.31 
(0.07) 
6 


-0.85 

(0.04) 

30 


-0.62 

(0.12) 

30 


0.03 
(0.06) 
78 


0.80 
(0.08) 
27 


0.84 
(0.04) 
30 


1.25 
(0.09) 
3 


1.40 
(0.06) 
6 


0.02 
(0.07) 
213 



"The first entry in each cell is the average STD P-DIF, multiplied by 10, for the indicated values of ad and fc, the second entry is the 
average standard error of the estimate, multiplied by 10, and the third entry is the number of item results over which the averages were 
computed. 

*Thc average standard error j large because of the sparsity of data for Item 20. 
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Table 13 
Pool 1: 

Average Expected Percent of C Results for Each Value of fc* 



Item 


Value of 


Difficulty 




ib) 


0 




0.0 




0.2 




21 




0.0 




0.1 




30 


-.65 


0.0 




0.1 




39 


0 


0.0 




0.1 




39 


.65 


0.0 




0.1 




33 


1.30 


0.0 




0.1 




33 


1.95 


0.0 




0.1 




18 


Average 


0.0 




0.1 




213 



The first entry in each ceU is the average expected percent of C results for the indicated values of and in the 900, 100 sample si 
condition, the second entry is the average percent for the 500» 500 sample size condition, and the third entry is the number of item 
results over which the averages were computed. 

W = 0 for all items in Pool 1. 
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Table 14 



Pool 2: 



Average Expected Percent of C Results for Each Combination of ad and 



Item 
Difficulty 

(0) 


-. /U 






-.ZO 


Value of 

ad 

n 






5"? 


.70 


Averace 


-1.95 






iO.3 


3.6 


0.2 


4.2 








3.2 








15.1 


3.3 


0.0 


3.9 








3.6 








3 


6 


9 


3 








21 


-1.30 






11.9 


3.4 


0.1 


3.4 


11.1 






3.7 








19.5 


2.9 


0.0 


2.7 


18.7 






4.9 








3 


6 


12 


6 


3 






30 


-.65 






11.4 


3.5 


0.1 


2.2 


9.8 




.67.2 


8.6 








18.7 


3.2 


0.0 


1.4 


15.5 




97.3 


12.3 








6 


6 


15 


6 


3 




3 


39 


0 


60.5 


49.6 


8.2 


0.5 


0.1 


3.7 


7.8 


38.3 


78.3 


19.9 




93.9 


83.0 


11.6 


0.1 


0.0 


3.1 


11.6 


75.3 


99.5 


302 




3 


3 


6 


3 


9 


6 


3 


3 


3 


39 


.65 






8.5 


2.7 


0.1 


4.3 


8.7 






3.8 








12.0 


1.8 


0.0 


4.0 


12.8 






5.0 








6 


3 


15 


3 


6 






33 


1.30 




44.8 


4.1 


0.8 


0.1 


5.2 


8.1 






6.9 






81.4 


4.0 


0.2 


0.0 


5.7 


11.5 






10.8 






3 


6 


3 


12 


3 


6 






33 


1.95 






2.0 




0.1 


7.5 


3.3 






2.7 








1.2 




0.0 


11.3 


2.8 






3.0 








3 




6 


3 


6 






18 


Average 


60.5 
93.9 
3 


47.2 
82.2 
6 


8.1 
11.7 
33 


2.8 
2.3 
27 


0.1 
0.0 
78 


4.0 
3.9 
30 


7.7 
11.1 

27 


38.3 
75.3 
3 


72.7 
98.4 
6 


7.9 
11.5 
213 



The first entry in each cell is the average expected percent of C results for the indicated values of ad and in the 900, 100 sample size 
condition, the second entry is the average percent for the 500, 500 sample size condition, and the third entry is the number of item 
results over wh-:h the averages were computed. 
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Table 15 



Pool 3: 

Average Expected Percent of C Results for Each Combination of ad and fc* 



Item 
Difficulty 


-. /u 




- 3S 


-.26 


Value of 
ad 

0 


.26 


.35 


.52 


.70 


Average 


-1.95 




26.8 


3.2 


1.2 


0.4 


12.5 








6.6 






52.1- 


2.5 


0.5 


0.1 


21.9 








11.1 






3 


3 


6 


6 


3 








21 


-1.30 




17.6 


3.6 


0.9 


0.3 


10.0 


22.0 






5.6 






34.1 


3.1 


0.2 


0.0 


15.3 


45.0 






9.8 






3 


3 


6 


12 


3 


3 






30 


-.65 


63.8 




4.2 


1.7 


0.1 


5.5 


18.8 






82 




95.2 




4.0 


0.9 


0.0 


6.2 


37.8 






11.9 




3 




6 


6 


15 


6 


3 






39 


0 






6.4 


2.6 


0.1 


5.0 


14.2 






4.4 








8.0 


2.6 


0.0 


5.3 


26.0 






6.4 








6 


6 


15 


6 


6 






39 


.65 






15.6 


4.6 


0.1 


2.3 


7.0 






4.8 








27.7 


4.4 


0.0 


1.2 


9.1 






7.2 








6 


3 


15 


3 


6 






33 


1.30 






8.6 


12.1 


0.3 


0.8 


3.6 


26.3 


54.7 


10.1 






12.0 


19.4 


0.1 


0.2 


32 


55.1 


92.4 


16.9 








O 
J 


0 
J 




■a 
J 




3 


3 


33 


1.95 






8.4 




0.7 


1.2 


1.0 




27.1 


6.6 






11.8 




0.1 


0.4 


0.3 




48.1 


102 








3 




3 


3 


6 




3 


18 


Average 


63.8 
95.2 
3 


22.2 
43.1 
6 


7.6 
10.9 
30 


3.0 
3.2 
30 


0.2 
0.0 
78 


5.3 
6.9 
27 


9.3 
16.0 
30 


26.3 
55.1 
3 


40.9 
70.3 
6 


6.6 
10.4 
213 



The first entry in each cell is the average expected percent of C results for the indicated values of ad and b in the 900, 100 sample si 
condition, the second entry is the average percent for the 500, 500 sample size condition, and the third entry is the number of item 
results over which the averages were computed. 
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Table 16 
Pretest Items (Pool 1): 

Average MH D-DIF for Each Combination of ad and b in the 500, 500 Sample Size Condition* 



Item 

Difficulty 
(W 


-.70 


-.35 


Value of 
ad 
0 


.35 


.70 


Average 


-1.30 


-2.4 


-1.3 


0.0 


1.2 


2.5 


0.0 




(0.04) 


(0.04) 


(0.04) 


(0.05) 


(0.05) 


(0.05) 


0 


-1.9 


-1.0 


0.0 


1.1 


11 


0.1 




(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.04) 


(0.03) 


1.30 


-1.0 


-0.6 


0.0 


0.7 


1.6 


0.1 




(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.03) 


Average 


-1.8 


-0.9 


0.0 


1.0 


2.1 


0.1 




(0.04) 


(0.04) 


(0.04) 


(0.04) 


(0.04) 


(0.04) 



'The first entry in each cell is the average MH D-DIF for the indicated values of ad and b and the second entry is the average standard 
error of tl;e estimate. Each cell average is based on 3 item results. 
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Table 17 
Pretest Items (Pool 1): 

Average STD P-DIF x 10 for Each Combination of ad and b in the 500, 500 Sample Size Condition' 



Item 

Difficulty 
ib) 


-.70 


-.35 


Value of 
ad 
0 


.35 


.70 


Average 


-1.30 


-1.3? 


-0.65 


-0.03 


0.47 


0.90 


-0.13 




(0.03) 


(0.02) 


(0.02) 


(0.02) 


(0.02) 


(0.02) 


0 


-1.41 


-0.71 


-0.01 


0.75 


1.47 


0.02 




(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.03) 


1.30 


-0.73 


-0.40 


0.05 


0.54 


1.20 


0.13 




(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.03) 


(0.03) 


Average 


-1.16 
(0.03) 


-0.59 
(0.03) 


0.00 
(0.03) 


0.58 
(0.03) 


1.20 
(0.02) 


0.01 
(0.03) 



•The first entry in each cell is the average STD P-DIF, multiplied by 10, for the indicated values of ad and b and the second entry is the 
average standard error of the estimate, multiplied by 10. Each cell average is based on 3 item results. 
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Table 18 
Pretest Items (Pool 1): 
Average Expected Percent of C Results for Each Combination of ad and 



Item 
Difficulty 
(W 


-.70 


-.35 


Value 
of ad 
0 


.35 


.70 


Average 


-1.30 


65.5 


10.0 


0.4 


7.4 


41.9 


25.0 




93.5 


15.5 


0.0 


10.5 


82.0 


40.3 


0 


41.4 


4.3 


0.1 


6.3 


53.8 


21.2 




79.6 


4.6 


0.0 


8.1 


88.9 


36.2 


1.30 


8.0 


1.1 


0.1 


1.9 


28.3 


7.9 




9.8 


0.3 


0.0 


0.9 


49.2 


12.1 


Average 


38.3 


5.1 


0.2 


5.2 


41.4 


18.0 




61.0 


6.8 


0.0 


6.5 


73.4 


29.5 



The first entry in each cell is the average expected percent of C results for the indicated values oi addxiAb in the 900, 100 sample size 
condition and the second entry is the average percent for the 500. 500 sample size condition. Each cell average is based on 3 item 
results. 
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Table 19 
Pretest Items (Pool 2): 
Average Expected Percent of C Results for Each Combination of ad and 



Item 






Value 








Difficulty 






of ad 








(b) 


-.70 


-.35 


0 


.35 


.70 


Average 


-1.30 


65.3 


9.6 


0.4 


7.4 


41.8 


24.9 




93.7 


14.9 


0.0 


10.3 


82.2 


40.2 


0 


43.4 


5.1 


0.1 


6.5 


53.4 


21.7 




81.4 


6.3 


0.0 


8.5 


89.2 


37.1 


1.30 


7.7 


1.0 


0.1 


1.8 


27.6 


7.6 




9.6 


0.3 


0.0 


0.9 


47.2 


11.6 


Average 


38.8 


5.2 


0.2 


5.2 


40.9 


18.1 




61.6 


7.1 


0.0 


6.5 


72.8 


29.6 



'The first entry in each cell is the average expected percent of C results for the indicated values of ad and b in the 900, 100 sample si 
condiaon and the second entry is the average percent for the 500, 500 sample size condition. Each ceU average is based on 3 item 
results. 
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Table 20 
Pretest Items (Pool 3); 

Average Expected Percent of C Results for Each Combination of ad and b* 



Item 






Value 








Difficulty 






of cd 








(b) 


-.70 


-.35 


0 


.35 


.70 


Average 


-1.30 


53.6 


5.2 


0.6 


11.5 


52.0 


24.6 




87.2 


6.5 


0.1 


19.4 


91.2 


40.9 


0 


40.4 


3. .4 


0.1 


9.0 


61.5 


22.9 




78.8 


3.2 


0.0 


14.3 


94.2 


38.1 


1.30 


10.5 


1.5 


0.1 


1.4 


24.9 


7.7 




13.4 


0.5 


0.0 


0.6 


45.9 


12.1 


Average 


34.8 


3.3 


0.3 


7.3 


46.1 


18.4 




59.8 


3.4 


0.0 


11.5 


77.1 


30.3 



The first entry in each cell is the average expected percent of C results for the indicated values of ad and i» in the 900, 100 sample si: 
condition and the second entry is the average percent for the 500, 500 sample size condiuon. Each ce. ^ average is based on 3 item 
results. 
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Table 21 

Median and Interquartile Range of Examinee Residuals - e) for Samples of 1,000' 
Group Pool 1 Pool 2 Pool 3 



Reference -0.032 
0.450 

Focal: N(-l,l) -0.038 -0.032 -0.099 

0.515 0.523 0.569 

Focal: N(0,1) -0.031 -0.014 -0.044 

0.458 0.487 0.502 

Focal: N(0.5.1) -0.036 -0.048 -0.011 

0.445 0.472 J.488 



Standard errors of medians are approximately 0.02. 
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Table 22 

Med'^n and Interquartile Range of Examinee Residuals (g^^ - e) for Samples of 1,000' 
Group Pool 1 Pool 2 Pool 3 



Standard errors of medians are ^proximately 0.01. 



FRir 



Reference -0.030 
0.379 

Focal:. N(-l,l) -0.006 -0.003 -0.063 

0.421 0.425 0.413 

Focal: N(0,1) -0.025 -0.009 -0.037 

0.333 0.337 0.391 

Focal: N(0.5.1) -0.028 -0.026 -0.013 

0.382 0.331 0.360 



I2l 



